BizFinBench.v2: A Unified Dual-Mode Bilingual Benchmark for Expert-Level Financial Capability Alignment

Xin Guo, Rongjunchen Zhang, Guilong Lu, Xuntao Guo, Shuai Jia, Zhi Yang, Liwen Zhang

2026-01-12

BizFinBench.v2: A Unified Dual-Mode Bilingual Benchmark for Expert-Level Financial Capability Alignment

Summary

This paper introduces a new way to test how well large language models, like ChatGPT, can handle real-world financial tasks. It highlights that current tests aren't realistic enough to predict how these models will actually perform when used by financial professionals.

What's the problem?

Existing tests for large language models in finance rely too much on fake or general information and only look at situations that don't change over time. This means a model might score well on a test, but still struggle with the fast-paced, authentic demands of actual financial work, like responding to investor questions or analyzing market trends as they happen. There's a disconnect between how well these models *seem* to do and how well they *actually* do.

What's the solution?

The researchers created a benchmark called BizFinBench.v2. This benchmark uses real financial data from both the Chinese and U.S. stock markets and includes tests that simulate live, online interactions. They analyzed a huge number of actual questions asked by users on financial platforms, categorizing them into eight main tasks and two online tasks, resulting in almost 30,000 question-and-answer pairs. They then tested several language models, including ChatGPT-5 and DeepSeek-R1, on this benchmark to see how they performed.

Why it matters?

BizFinBench.v2 is important because it provides a much more accurate way to evaluate large language models for use in the financial industry. It goes beyond simple testing and looks at how these models handle the complexities of real-world financial scenarios. This will help developers improve these models and ensure they are truly useful for financial professionals, ultimately leading to better and more reliable financial services.

Abstract

Large language models have undergone rapid evolution, emerging as a pivotal technology for intelligence in financial operations. However, existing benchmarks are often constrained by pitfalls such as reliance on simulated or general-purpose samples and a focus on singular, offline static scenarios. Consequently, they fail to align with the requirements for authenticity and real-time responsiveness in financial services, leading to a significant discrepancy between benchmark performance and actual operational efficacy. To address this, we introduce BizFinBench.v2, the first large-scale evaluation benchmark grounded in authentic business data from both Chinese and U.S. equity markets, integrating online assessment. We performed clustering analysis on authentic user queries from financial platforms, resulting in eight fundamental tasks and two online tasks across four core business scenarios, totaling 29,578 expert-level Q&A pairs. Experimental results demonstrate that ChatGPT-5 achieves a prominent 61.5% accuracy in main tasks, though a substantial gap relative to financial experts persists; in online tasks, DeepSeek-R1 outperforms all other commercial LLMs. Error analysis further identifies the specific capability deficiencies of existing models within practical financial business contexts. BizFinBench.v2 transcends the limitations of current benchmarks, achieving a business-level deconstruction of LLM financial capabilities and providing a precise basis for evaluating efficacy in the widespread deployment of LLMs within the financial domain. The data and code are available at https://github.com/HiThink-Research/BizFinBench.v2.

View Paper