DeepSeek-OCR: Contexts Optical Compression

Haoran Wei, Yaofeng Sun, Yukun Li

2025-10-22

DeepSeek-OCR: Contexts Optical Compression

Summary

This paper introduces DeepSeek-OCR, a new system designed to efficiently compress images of text, like scanned documents, into a format that can be easily processed by large language models (LLMs). It's a first step in exploring how to handle very long pieces of visual information.

What's the problem?

Large language models are getting better at understanding text, but they struggle with very long inputs, especially when that information comes from images of text. Processing high-resolution images directly requires a lot of computing power and memory. The challenge is to represent these images in a compact way without losing too much important information, allowing LLMs to 'read' them effectively.

What's the solution?

The researchers created DeepSeek-OCR, which has two main parts: a 'DeepEncoder' that compresses the image into a smaller set of 'vision tokens', and a decoder that understands those tokens. The DeepEncoder is specifically designed to reduce the amount of data needed to represent the image while still keeping the important details. They tested it by compressing text at different rates – sometimes squeezing the information down to 1/10th its original size, and even 1/20th – and then measuring how accurately the system could recognize the text.

Why it matters?

This work is important because it shows a promising way to handle long-context visual information. It could be useful for things like digitizing and searching through historical documents, or for improving how LLMs understand images and text together. Plus, it's a practical tool that can quickly generate large amounts of training data for other AI models, processing over 200,000 pages a day with a single powerful computer.

Abstract

We present DeepSeek-OCR as an initial investigation into the feasibility of compressing long contexts via optical 2D mapping. DeepSeek-OCR consists of two components: DeepEncoder and DeepSeek3B-MoE-A570M as the decoder. Specifically, DeepEncoder serves as the core engine, designed to maintain low activations under high-resolution input while achieving high compression ratios to ensure an optimal and manageable number of vision tokens. Experiments show that when the number of text tokens is within 10 times that of vision tokens (i.e., a compression ratio < 10x), the model can achieve decoding (OCR) precision of 97%. Even at a compression ratio of 20x, the OCR accuracy still remains at about 60%. This shows considerable promise for research areas such as historical long-context compression and memory forgetting mechanisms in LLMs. Beyond this, DeepSeek-OCR also demonstrates high practical value. On OmniDocBench, it surpasses GOT-OCR2.0 (256 tokens/page) using only 100 vision tokens, and outperforms MinerU2.0 (6000+ tokens per page on average) while utilizing fewer than 800 vision tokens. In production, DeepSeek-OCR can generate training data for LLMs/VLMs at a scale of 200k+ pages per day (a single A100-40G). Codes and model weights are publicly accessible at http://github.com/deepseek-ai/DeepSeek-OCR.

View Paper