FinForge: Semi-Synthetic Financial Benchmark Generation
Glenn Matlin, Akhil Theerthala, Anant Gupta, Anirudh JM, Rayan Castilla, Yi Mei Ng, Sudheer Chava
2026-01-13
Summary
This paper introduces FinForge, a new way to create benchmarks for testing how well language models understand and reason about finance. It addresses the difficulty of finding good, specific datasets for evaluating these models in the financial world.
What's the problem?
Evaluating language models in specialized areas like finance is hard because there aren't many publicly available, high-quality datasets focused on financial topics. Existing tests are too general and don't accurately measure a model's ability to handle the complex reasoning and calculations needed for real-world financial tasks. Basically, current benchmarks don't really test if a model *actually* understands finance.
What's the solution?
The researchers built FinForge, a system that combines human experts with the power of a large language model (Gemini 2.5 Flash) to automatically create financial questions and answers. They started with reliable financial documents, then used the language model to generate questions, and finally had humans verify the accuracy of those questions and answers. This resulted in a dataset called FinForge-5k, containing over 5,000 validated question-answer pairs covering 11 different areas of finance.
Why it matters?
This work is important because it provides a better way to test and improve language models for financial applications. By revealing the strengths and weaknesses of current models on this new benchmark, it helps guide developers in building more reliable and capable AI systems for the finance industry. It’s a step towards ensuring these models can be trusted with important financial decisions.
Abstract
Evaluating Language Models (LMs) in specialized, high-stakes domains such as finance remains a significant challenge due to the scarcity of open, high-quality, and domain-specific datasets. Existing general-purpose benchmarks provide broad coverage but lack the depth and domain fidelity needed to assess LMs' capabilities for real-world financial reasoning, which requires both conceptual understanding and quantitative rigor. To address this gap, we introduce FinForge, a scalable, semi-synthetic pipeline for constructing finance-specific evaluation benchmarks through a hybrid of expert-guided data curation and controlled LM-based synthesis. FinForge combines manual and programmatic corpus construction from authoritative financial sources with structured question generation and validation using Gemini 2.5 Flash. To demonstrate the pipeline's efficacy, we produce FinForge-5k, a snapshot benchmark comprising over 5,000 human-validated question-answer pairs across 11 finance subdomains, derived from a curated corpus of 100,000 verified documents totaling 143M tokens. Evaluation of state-of-the-art open-source and closed-source models on FinForge-5k reveals significant differences in financial reasoning, with leading models achieving accuracy levels near 80%. These findings underscore the framework's utility for diagnosing current model limitations and guiding future improvements in financial domain competence. All code and data are available at https://github.com/gtfintechlab/FinForge.