Visual Text Generation in the Wild

Yuanzhi Zhu, Jiawei Liu, Feiyu Gao, Wenyu Liu, Xinggang Wang, Peng Wang, Fei Huang, Cong Yao, Zhibo Yang

2024-07-22

Summary

This paper discusses a new method called SceneVTG for generating high-quality text images in real-world scenarios. It focuses on creating realistic text overlays that not only look good but also fit well within their backgrounds and can be easily detected and recognized by machines.

What's the problem?

Generating text images that look realistic and fit naturally into various scenes is challenging. Existing methods often struggle to meet three key criteria: the images must be photo-realistic (fidelity), the text must make sense in the context of the scene (reasonability), and the generated images should be useful for tasks like detecting and recognizing text (utility). Many current techniques either focus on one of these aspects or fail to combine them effectively, limiting their practical applications.

What's the solution?

The authors propose SceneVTG, a visual text generator that uses a two-stage process to improve text image generation. First, it employs a Multimodal Large Language Model to identify suitable areas for text and suggest appropriate content for those areas. Then, it uses a conditional diffusion model to create the actual text images based on this information. This approach allows SceneVTG to produce text images that are not only visually appealing but also contextually relevant and useful for further analysis.

Why it matters?

This research is important because it advances the field of visual text generation, making it possible to create more realistic and functional text images. This can have various applications, such as improving automated systems for reading signs, enhancing augmented reality experiences, and facilitating better interaction between humans and machines in visual contexts.

Abstract

Recently, with the rapid advancements of generative models, the field of visual text generation has witnessed significant progress. However, it is still challenging to render high-quality text images in real-world scenarios, as three critical criteria should be satisfied: (1) Fidelity: the generated text images should be photo-realistic and the contents are expected to be the same as specified in the given conditions; (2) Reasonability: the regions and contents of the generated text should cohere with the scene; (3) Utility: the generated text images can facilitate related tasks (e.g., text detection and recognition). Upon investigation, we find that existing methods, either rendering-based or diffusion-based, can hardly meet all these aspects simultaneously, limiting their application range. Therefore, we propose in this paper a visual text generator (termed SceneVTG), which can produce high-quality text images in the wild. Following a two-stage paradigm, SceneVTG leverages a Multimodal Large Language Model to recommend reasonable text regions and contents across multiple scales and levels, which are used by a conditional diffusion model as conditions to generate text images. Extensive experiments demonstrate that the proposed SceneVTG significantly outperforms traditional rendering-based methods and recent diffusion-based methods in terms of fidelity and reasonability. Besides, the generated images provide superior utility for tasks involving text detection and text recognition. Code and datasets are available at AdvancedLiterateMachinery.

View Paper