NVIDIA Nemotron Parse 1.1

Kateryna Chumachenko, Amala Sanjay Deshmukh, Jarno Seppanen, Ilia Karmanov, Chia-Chih Chen, Lukas Voegtle, Philipp Fischer, Marek Wawrzos, Saeid Motiian, Roman Ageev, Kedi Wu, Alexandre Milesi, Maryam Moosaei, Krzysztof Pawelec, Padmavathy Subramanian, Mehrzad Samadi, Xin Yu, Celina Dear, Sarah Stoddard, Jenna Diamond, Jesse Oliver, Leanna Chraghchian

2025-11-27

Summary

This paper introduces Nemotron-Parse-1.1, a new and improved computer model designed to understand and extract information from documents, including images and charts.

What's the problem?

Currently, getting computers to accurately 'read' documents – not just the text, but also things like tables, pictures, and the overall layout – is a difficult task. Existing models are often either too large and slow, or not accurate enough when dealing with complex documents that have a lot of visual information packed in.

What's the solution?

The researchers created Nemotron-Parse-1.1, which uses a specific type of artificial intelligence architecture called an encoder-decoder. It's relatively small in terms of the number of parameters it uses (885 million), making it efficient, but still very powerful. It can identify text within images, understand markdown formatting, and accurately extract data from tables and charts. They also created a faster version, Nemotron-Parse-1.1-TC, that works almost as well but is 20% quicker.

Why it matters?

This model is important because it provides a strong, lightweight solution for document understanding. By releasing the model and its training data publicly, other researchers and developers can build upon this work to create even better tools for tasks like automated data entry, document analysis, and making information more accessible.

Abstract

We introduce Nemotron-Parse-1.1, a lightweight document parsing and OCR model that advances the capabilities of its predecessor, Nemoretriever-Parse-1.0. Nemotron-Parse-1.1 delivers improved capabilities across general OCR, markdown formatting, structured table parsing, and text extraction from pictures, charts, and diagrams. It also supports a longer output sequence length for visually dense documents. As with its predecessor, it extracts bounding boxes of text segments, as well as corresponding semantic classes. Nemotron-Parse-1.1 follows an encoder-decoder architecture with 885M parameters, including a compact 256M-parameter language decoder. It achieves competitive accuracy on public benchmarks making it a strong lightweight OCR solution. We release the model weights publicly on Huggingface, as well as an optimized NIM container, along with a subset of the training data as part of the broader Nemotron-VLM-v2 dataset. Additionally, we release Nemotron-Parse-1.1-TC which operates on a reduced vision token length, offering a 20% speed improvement with minimal quality degradation.

View Paper