Glyph-ByT5-v2: A Strong Aesthetic Baseline for Accurate Multilingual Visual Text Rendering

Zeyu Liu, Weicong Liang, Yiming Zhao, Bohan Chen, Ji Li, Yuhui Yuan

2024-06-17

Glyph-ByT5-v2: A Strong Aesthetic Baseline for Accurate Multilingual Visual Text Rendering

Summary

This paper discusses Glyph-ByT5-v2, a new model designed to improve how text is visually rendered in multiple languages. It aims to create accurate and visually appealing text representations for graphic design and other applications.

What's the problem?

The original Glyph-ByT5 model was good at rendering text visually but only worked well for English and wasn't very aesthetically pleasing. This limited its usefulness, especially in a world where many languages are used, and where visual appeal is important for things like graphic design and marketing.

What's the solution?

To solve these issues, the authors developed Glyph-ByT5-v2 and another model called Glyph-SDXL-v2. They created a large dataset that includes over 1 million pairs of glyphs (visual representations of text) and graphic designs, covering ten different languages. They also built a benchmark with 1,000 prompts to test how well the models can spell in different languages visually. Additionally, they used a new learning approach to improve the aesthetic quality of the rendered text.

Why it matters?

This research is significant because it enhances the ability of AI models to generate beautiful and accurate text in multiple languages. This improvement is crucial for various applications, such as digital art, educational tools, and translation services. By addressing the limitations of previous models, Glyph-ByT5-v2 sets a new standard for multilingual visual text rendering, helping to make AI more effective in diverse linguistic contexts.

Abstract

Recently, Glyph-ByT5 has achieved highly accurate visual text rendering performance in graphic design images. However, it still focuses solely on English and performs relatively poorly in terms of visual appeal. In this work, we address these two fundamental limitations by presenting Glyph-ByT5-v2 and Glyph-SDXL-v2, which not only support accurate visual text rendering for 10 different languages but also achieve much better aesthetic quality. To achieve this, we make the following contributions: (i) creating a high-quality multilingual glyph-text and graphic design dataset consisting of more than 1 million glyph-text pairs and 10 million graphic design image-text pairs covering nine other languages, (ii) building a multilingual visual paragraph benchmark consisting of 1,000 prompts, with 100 for each language, to assess multilingual visual spelling accuracy, and (iii) leveraging the latest step-aware preference learning approach to enhance the visual aesthetic quality. With the combination of these techniques, we deliver a powerful customized multilingual text encoder, Glyph-ByT5-v2, and a strong aesthetic graphic generation model, Glyph-SDXL-v2, that can support accurate spelling in 10 different languages. We perceive our work as a significant advancement, considering that the latest DALL-E3 and Ideogram 1.0 still struggle with the multilingual visual text rendering task.

View Paper