TRivia: Self-supervised Fine-tuning of Vision-Language Models for Table Recognition

Junyuan Zhang, Bin Wang, Qintong Zhang, Fan Wu, Zichen Wen, Jialin Lu, Junjie Shan, Ziqi Zhao, Shuya Yang, Ziling Wang, Ziyang Miao, Huaping Zhong, Yuhang Zang, Xiaoyi Dong, Ka-Ho Chow, Conghui He

2025-12-03

TRivia: Self-supervised Fine-tuning of Vision-Language Models for Table Recognition

Summary

This paper introduces a new method called TRivia for recognizing tables in images, turning them into formats like HTML or Markdown. It focuses on improving open-source table recognition models, which are often less accurate than those developed by big companies because they have less training data.

What's the problem?

Recognizing tables in images is a key part of understanding documents, but it usually requires a lot of labeled data – images where someone has already identified the table structure. Getting this labeled data is expensive and time-consuming. Because of this, open-source table recognition models, which many people need to use due to privacy concerns or cost, aren't as good as the models created by companies with lots of resources.

What's the solution?

The researchers developed TRivia, a way to train table recognition models without needing any labeled data. It works by having the model ask itself questions about the table image and then checking if its own recognition of the table allows it to answer those questions correctly. This 'self-testing' process, guided by where the model is focusing its attention, helps it learn to understand and structure tables on its own. They used a technique called Group Relative Policy Optimization to figure out which unlabeled images would be most helpful for the model to learn from.

Why it matters?

TRivia allows for the creation of powerful, open-source table recognition models that don't rely on expensive labeled data. This is important because it makes table recognition technology more accessible to everyone, especially those who can't use proprietary models due to privacy or budget limitations. The resulting model, TRivia-3B, actually performs better than some of the leading commercial systems on standard tests.

Abstract

Table recognition (TR) aims to transform table images into semi-structured representations such as HTML or Markdown. As a core component of document parsing, TR has long relied on supervised learning, with recent efforts dominated by fine-tuning vision-language models (VLMs) using labeled data. While VLMs have brought TR to the next level, pushing performance further demands large-scale labeled data that is costly to obtain. Consequently, although proprietary models have continuously pushed the performance boundary, open-source models, often trained with limited resources and, in practice, the only viable option for many due to privacy regulations, still lag far behind. To bridge this gap, we introduce TRivia, a self-supervised fine-tuning method that enables pretrained VLMs to learn TR directly from unlabeled table images in the wild. Built upon Group Relative Policy Optimization, TRivia automatically identifies unlabeled samples that most effectively facilitate learning and eliminates the need for human annotations through a question-answering-based reward mechanism. An attention-guided module generates diverse questions for each table image, and the ability to interpret the recognition results and answer them correctly provides feedback to optimize the TR model. This closed-loop process allows the TR model to autonomously learn to recognize, structure, and reason over tables without labeled data. Leveraging this pipeline, we present TRivia-3B, an open-sourced, compact, and state-of-the-art TR model that surpasses existing systems (e.g., Gemini 2.5 Pro, MinerU2.5) on three popular benchmarks. Model and code are released at: https://github.com/opendatalab/TRivia

View Paper