DODO: Discrete OCR Diffusion Models

Sean Man, Roy Ganz, Roi Ronen, Shahar Tsiper, Shai Mazor, Niv Nayman

2026-02-24

Summary

This paper focuses on making Optical Character Recognition, which is turning images of text into actual text, much faster. Current methods are accurate but slow, especially with long documents.

What's the problem?

Existing systems use a method called autoregressive decoding, which processes text one piece at a time. This works well, but it takes a long time because each piece depends on the one before it. The paper points out that OCR is different from general text creation – there's usually only *one* correct answer based on the image. They tried using diffusion models, which could potentially create text all at once, but those models created errors and inconsistencies when applied to OCR's strict requirements.

What's the solution?

The researchers developed a new system called DODO. DODO uses a special type of diffusion model called 'block discrete diffusion'. Instead of trying to generate the entire text at once, it breaks the text down into smaller blocks and generates those in parallel. This avoids the errors seen in other diffusion models and allows for much faster processing.

Why it matters?

This research is important because it significantly speeds up OCR without sacrificing accuracy. Being able to quickly and accurately digitize documents has huge implications for things like archiving, searching through old records, and making information more accessible. DODO is up to three times faster than current methods, which is a big improvement.

Abstract

Optical Character Recognition (OCR) is a fundamental task for digitizing information, serving as a critical bridge between visual data and textual understanding. While modern Vision-Language Models (VLM) have achieved high accuracy in this domain, they predominantly rely on autoregressive decoding, which becomes computationally expensive and slow for long documents as it requires a sequential forward pass for every generated token. We identify a key opportunity to overcome this bottleneck: unlike open-ended generation, OCR is a highly deterministic task where the visual input strictly dictates a unique output sequence, theoretically enabling efficient, parallel decoding via diffusion models. However, we show that existing masked diffusion models fail to harness this potential; those introduce structural instabilities that are benign in flexible tasks, like captioning, but catastrophic for the rigid, exact-match requirements of OCR. To bridge this gap, we introduce DODO, the first VLM to utilize block discrete diffusion and unlock its speedup potential for OCR. By decomposing generation into blocks, DODO mitigates the synchronization errors of global diffusion. Empirically, our method achieves near state-of-the-art accuracy while enabling up to 3x faster inference compared to autoregressive baselines.

View Paper