LongCat-Image Technical Report

Meituan LongCat Team, Hanghang Ma, Haoxian Tan, Jiale Huang, Junqiang Wu, Jun-Yan He, Lishuai Gao, Songlin Xiao, Xiaoming Wei, Xiaoqi Ma, Xunliang Cai, Yayong Guan, Jie Hu

2025-12-09

Summary

This paper introduces LongCat-Image, a new image generation model that's designed to be really good at understanding both Chinese and English text and turning it into realistic images, while also being easy for others to use and build upon.

What's the problem?

Current image generation models struggle with a few key things: accurately displaying text within images, creating truly photorealistic pictures, being efficient enough to run without super expensive hardware, and being openly available for anyone to experiment with and improve. Specifically, existing models often mess up when trying to render Chinese characters, especially the more complex ones, and many are huge, requiring a lot of computing power.

What's the solution?

The creators of LongCat-Image tackled these problems by carefully selecting and organizing the data used to train the model, focusing on quality at every stage. They also used a smaller, more efficient model design – only 6 billion parameters compared to the 20 billion or more used by others – which makes it faster and cheaper to run. They also focused on making the model exceptionally good at rendering Chinese characters, even rare ones, and they made the entire training process and model itself publicly available.

Why it matters?

LongCat-Image is important because it sets a new standard for generating images from text, especially when it comes to Chinese language support. Its smaller size makes it more accessible to researchers and developers who don't have access to massive computing resources, and the open-source nature of the project encourages collaboration and further innovation in the field of image generation and editing.

Abstract

We introduce LongCat-Image, a pioneering open-source and bilingual (Chinese-English) foundation model for image generation, designed to address core challenges in multilingual text rendering, photorealism, deployment efficiency, and developer accessibility prevalent in current leading models. 1) We achieve this through rigorous data curation strategies across the pre-training, mid-training, and SFT stages, complemented by the coordinated use of curated reward models during the RL phase. This strategy establishes the model as a new state-of-the-art (SOTA), delivering superior text-rendering capabilities and remarkable photorealism, and significantly enhancing aesthetic quality. 2) Notably, it sets a new industry standard for Chinese character rendering. By supporting even complex and rare characters, it outperforms both major open-source and commercial solutions in coverage, while also achieving superior accuracy. 3) The model achieves remarkable efficiency through its compact design. With a core diffusion model of only 6B parameters, it is significantly smaller than the nearly 20B or larger Mixture-of-Experts (MoE) architectures common in the field. This ensures minimal VRAM usage and rapid inference, significantly reducing deployment costs. Beyond generation, LongCat-Image also excels in image editing, achieving SOTA results on standard benchmarks with superior editing consistency compared to other open-source works. 4) To fully empower the community, we have established the most comprehensive open-source ecosystem to date. We are releasing not only multiple model versions for text-to-image and image editing, including checkpoints after mid-training and post-training stages, but also the entire toolchain of training procedure. We believe that the openness of LongCat-Image will provide robust support for developers and researchers, pushing the frontiers of visual content creation.

View Paper