OfficeQA Pro: An Enterprise Benchmark for End-to-End Grounded Reasoning

Krista Opsahl-Ong, Arnav Singhvi, Jasmine Collins, Ivan Zhou, Cindy Wang, Ashutosh Baheti, Owen Oertell, Jacob Portes, Sam Havens, Erich Elsen, Michael Bendersky, Matei Zaharia, Xing Chen

2026-03-10

OfficeQA Pro: An Enterprise Benchmark for End-to-End Grounded Reasoning

Summary

This paper introduces a really tough test, called OfficeQA Pro, designed to see how well AI systems can understand and reason with a huge collection of real-world documents – specifically, almost a century’s worth of U.S. Treasury reports.

What's the problem?

Current AI models, even the most advanced ones like Claude, GPT, and Gemini, struggle with tasks that require them to deeply understand complex documents and pull out specific information. They often rely on what they’ve already been trained on, which isn’t enough when dealing with specialized or constantly updated information. Even when given access to the internet, they still aren’t very accurate when asked questions based on these detailed reports, and they have a hard time with both the text *and* the tables within the documents.

What's the solution?

The researchers created OfficeQA Pro, a set of 133 challenging questions based on the Treasury reports. They then tested different AI models on these questions, both with and without access to the reports themselves. They also experimented with giving the AI a more organized version of the documents, created by a tool called ai_parse_document, to see if that helped. They also looked at how different choices about the AI model, how tables are represented, and how information is retrieved affected the results.

Why it matters?

This work is important because it shows that even the best AI systems aren’t yet reliable enough to handle complex, real-world reasoning tasks that require understanding large amounts of detailed information. It highlights the need for better ways to help AI systems process and understand documents, which is crucial for things like financial analysis, legal research, and other important applications where accuracy is essential.

Abstract

We introduce OfficeQA Pro, a benchmark for evaluating AI agents on grounded, multi-document reasoning over a large and heterogeneous document corpus. The corpus consists of U.S. Treasury Bulletins spanning nearly 100 years, comprising 89,000 pages and over 26 million numerical values. OfficeQA Pro consists of 133 questions that require precise document parsing, retrieval, and analytical reasoning across both unstructured text and tabular data. Frontier LLMs including Claude Opus 4.6, GPT-5.4, and Gemini 3.1 Pro Preview achieve less than 5% accuracy on OfficeQA Pro when relying on parametric knowledge, and less than 12% with additional access to the web. When provided directly with the document corpus, frontier agents still struggle on over half of questions, scoring 34.1% on average. We find that providing agents with a structured document representation produced by Databricks' ai_parse_document yields a 16.1% average relative performance gain across agents. We conduct additional ablations to study the effects of model selection, table representation, retrieval strategy, and test-time scaling on performance. Despite these improvements, significant headroom remains before agents can be considered reliable at enterprise-grade grounded reasoning.

View Paper