HunyuanOCR Technical Report
Hunyuan Vision Team, Pengyuan Lyu, Xingyu Wan, Gengluo Li, Shangpin Peng, Weinong Wang, Liang Wu, Huawen Shen, Yu Zhou, Canhui Tang, Qi Yang, Qiming Peng, Bin Luo, Hower Yang, Houwen Peng, Hongming Yang, Senhao Xie, Binghong Wu, Mana Yang, Sergey Wang, Raccoon Liu, Dick Zhu
2025-11-26
Summary
This paper introduces HunyuanOCR, a new and efficient artificial intelligence model designed specifically for Optical Character Recognition (OCR) – basically, turning images of text into actual text a computer can understand and work with.
What's the problem?
Traditional OCR systems often rely on a series of separate steps, like first figuring out the layout of the page and *then* recognizing the text, which can lead to errors building up along the way. Also, existing models are either very specialized for just OCR, lacking broader understanding, or are huge and require a lot of computing power, making them impractical for many uses. There was a need for a model that could do everything well – recognize text, understand its meaning, and translate it – all while being relatively small and fast.
What's the solution?
The researchers created HunyuanOCR, which combines a visual processing component (ViT) with a language model (LLM) using a special connector. This allows the model to directly process images and understand the text within them, all in one go, without needing those separate pre-processing steps. They also focused on using a lot of high-quality training data and even used a technique called Reinforcement Learning to further improve its accuracy. Finally, they made the model freely available for others to use and provided tools to make it run efficiently.
Why it matters?
HunyuanOCR is important because it sets a new standard for OCR performance, even beating commercial services and larger models, while being much smaller and faster. This means it can be used in more places, like on phones or in situations where powerful computers aren't available. By combining different capabilities into one model and making it open-source, the researchers hope to encourage further innovation in both research and real-world applications of OCR technology.
Abstract
This paper presents HunyuanOCR, a commercial-grade, open-source, and lightweight (1B parameters) Vision-Language Model (VLM) dedicated to OCR tasks. The architecture comprises a Native Vision Transformer (ViT) and a lightweight LLM connected via an MLP adapter. HunyuanOCR demonstrates superior performance, outperforming commercial APIs, traditional pipelines, and larger models (e.g., Qwen3-VL-4B). Specifically, it surpasses current public solutions in perception tasks (Text Spotting, Parsing) and excels in semantic tasks (IE, Text Image Translation), securing first place in the ICDAR 2025 DIMT Challenge (Small Model Track). Furthermore, it achieves state-of-the-art (SOTA) results on OCRBench among VLMs with fewer than 3B parameters. HunyuanOCR achieves breakthroughs in three key aspects: 1) Unifying Versatility and Efficiency: We implement comprehensive support for core capabilities including spotting, parsing, IE, VQA, and translation within a lightweight framework. This addresses the limitations of narrow "OCR expert models" and inefficient "General VLMs". 2) Streamlined End-to-End Architecture: Adopting a pure end-to-end paradigm eliminates dependencies on pre-processing modules (e.g., layout analysis). This fundamentally resolves error propagation common in traditional pipelines and simplifies system deployment. 3) Data-Driven and RL Strategies: We confirm the critical role of high-quality data and, for the first time in the industry, demonstrate that Reinforcement Learning (RL) strategies yield significant performance gains in OCR tasks. HunyuanOCR is officially open-sourced on HuggingFace. We also provide a high-performance deployment solution based on vLLM, placing its production efficiency in the top tier. We hope this model will advance frontier research and provide a solid foundation for industrial applications.