FinAuditing: A Financial Taxonomy-Structured Multi-Document Benchmark for Evaluating LLMs

Yan Wang, Keyi Wang, Shanshan Yang, Jaisal Patel, Jeff Zhao, Fengran Mo, Xueqing Peng, Lingfei Qian, Jimin Huang, Guojun Xiong, Xiao-Yang Liu, Jian-Yun Nie

2025-10-14

FinAuditing: A Financial Taxonomy-Structured Multi-Document Benchmark for Evaluating LLMs

Summary

This paper investigates how well current artificial intelligence, specifically large language models, can handle the complex task of financial auditing. It points out that while these AI models are good with regular text, they struggle with the specific structure and rules found in financial documents like those using GAAP and XBRL.

What's the problem?

Financial auditing is becoming harder to do automatically because the rules for accounting (GAAP) are complicated and financial reports (XBRL filings) are organized in a very specific, layered way. Current AI models aren't designed to understand this structure and how different parts of a financial report relate to each other, making it difficult to verify the accuracy of financial information.

What's the solution?

The researchers created a new testing benchmark called FinAuditing. This benchmark uses real financial reports and tests AI models on three key skills needed for auditing: understanding the meaning of financial terms (FinSM), checking if different parts of the report agree with each other (FinRE), and verifying that numbers are consistent throughout the report (FinMR). They then tested 13 different AI models on this benchmark to see how well they performed.

Why it matters?

The findings show that even the best AI models struggle with financial reasoning when it comes to understanding the specific rules and structure of financial reports. This research highlights the need for new AI systems specifically designed for finance that can reliably and accurately audit financial information, ultimately leading to more trustworthy financial systems.

Abstract

The complexity of the Generally Accepted Accounting Principles (GAAP) and the hierarchical structure of eXtensible Business Reporting Language (XBRL) filings make financial auditing increasingly difficult to automate and verify. While large language models (LLMs) have demonstrated strong capabilities in unstructured text understanding, their ability to reason over structured, interdependent, and taxonomy-driven financial documents remains largely unexplored. To fill this gap, we introduce FinAuditing, the first taxonomy-aligned, structure-aware, multi-document benchmark for evaluating LLMs on financial auditing tasks. Built from real US-GAAP-compliant XBRL filings, FinAuditing defines three complementary subtasks, FinSM for semantic consistency, FinRE for relational consistency, and FinMR for numerical consistency, each targeting a distinct aspect of structured auditing reasoning. We further propose a unified evaluation framework integrating retrieval, classification, and reasoning metrics across these subtasks. Extensive zero-shot experiments on 13 state-of-the-art LLMs reveal that current models perform inconsistently across semantic, relational, and mathematical dimensions, with accuracy drops of up to 60-90% when reasoning over hierarchical multi-document structures. Our findings expose the systematic limitations of modern LLMs in taxonomy-grounded financial reasoning and establish FinAuditing as a foundation for developing trustworthy, structure-aware, and regulation-aligned financial intelligence systems. The benchmark dataset is available at Hugging Face.

View Paper