UM-Text: A Unified Multimodal Model for Image Understanding

Lichen Ma, Xiaolong Fu, Gaojing Zhou, Zipeng Guo, Ting Zhu, Yichun Liu, Yu Shi, Jason Li, Junshi Huang

2026-01-14

UM-Text: A Unified Multimodal Model for Image Understanding

Summary

This paper introduces a new method, UM-Text, for editing text within images using natural language commands, like telling a computer to 'make the font bigger' or 'change the color to blue'. It focuses on making sure the edited text looks like it naturally belongs in the image, matching the existing style.

What's the problem?

Currently, editing text in images based on what you *tell* the computer to do is difficult. Existing methods are complicated, requiring you to specify many details about the text – like font, size, and placement – instead of just describing what you want. They also often struggle to make the new text blend in with the image's overall look and feel, appearing unnatural or out of place.

What's the solution?

The researchers developed UM-Text, which uses a system that understands both the instructions you give and the image itself. It uses something called a Visual Language Model to figure out how to best design the text, considering both what you want it to say and how it should look within the image. They also created a way to combine different pieces of information about the image and text, and a special training process to help the model learn effectively. To help with training, they even created a new, large dataset of images with text in them.

Why it matters?

This work is important because it makes image editing much more intuitive. Instead of needing technical skills to change text in an image, you can simply tell the computer what you want, and it will try to do it in a way that looks natural and professional. This has potential applications in graphic design, social media editing, and many other areas where visual communication is key.

Abstract

With the rapid advancement of image generation, visual text editing using natural language instructions has received increasing attention. The main challenge of this task is to fully understand the instruction and reference image, and thus generate visual text that is style-consistent with the image. Previous methods often involve complex steps of specifying the text content and attributes, such as font size, color, and layout, without considering the stylistic consistency with the reference image. To address this, we propose UM-Text, a unified multimodal model for context understanding and visual text editing by natural language instructions. Specifically, we introduce a Visual Language Model (VLM) to process the instruction and reference image, so that the text content and layout can be elaborately designed according to the context information. To generate an accurate and harmonious visual text image, we further propose the UM-Encoder to combine the embeddings of various condition information, where the combination is automatically configured by VLM according to the input instruction. During training, we propose a regional consistency loss to offer more effective supervision for glyph generation on both latent and RGB space, and design a tailored three-stage training strategy to further enhance model performance. In addition, we contribute the UM-DATA-200K, a large-scale visual text image dataset on diverse scenes for model training. Extensive qualitative and quantitative results on multiple public benchmarks demonstrate that our method achieves state-of-the-art performance.

View Paper