FinLFQA: Evaluating Attributed Text Generation of LLMs in Financial Long-Form Question Answering
Yitao Long, Tiansheng Hu, Yilun Zhao, Arman Cohan, Chen Zhao
2025-10-09
Summary
This paper investigates how well large language models (LLMs) can answer complicated questions, specifically in the field of finance, and importantly, how well they can *prove* their answers are correct.
What's the problem?
LLMs are known to sometimes 'hallucinate,' meaning they confidently state things that aren't true. While simply providing sources for answers helps, this paper argues that in areas like finance, it's not enough to just point to a document. You need to show the *reasoning* behind the answer, including the specific data used, the calculations made, and the financial knowledge applied. Existing tests don't adequately assess this more complex type of 'attribution'.
What's the solution?
The researchers created a new test called FinLFQA specifically for financial questions. This test doesn't just check if the answer is right, but also if the LLM can clearly show where the supporting evidence comes from in financial reports, explain the steps in any calculations, and demonstrate understanding of relevant financial concepts. They also built a way to automatically score both the answer's quality and the quality of its explanation. They tested eight different LLMs using various methods to generate answers and explanations.
Why it matters?
This work is important because relying on incorrect information in finance can have serious consequences. Simply getting an answer isn't enough; you need to trust that the answer is based on solid evidence and sound reasoning. This research helps us understand how well LLMs can provide that level of trustworthiness and points the way towards building more reliable AI systems for financial applications.
Abstract
Large Language Models (LLMs) frequently hallucinate to long-form questions, producing plausible yet factually incorrect answers. A common mitigation strategy is to provide attribution to LLM outputs. However, existing benchmarks primarily focus on simple attribution that retrieves supporting textual evidence as references. We argue that in real-world scenarios such as financial applications, attribution goes beyond reference retrieval. We introduce FinLFQA, a benchmark designed to evaluate the ability of LLMs to generate long-form answers to complex financial questions with reliable and nuanced attributions. FinLFQA evaluates three critical aspects of attribution through human annotations: (1) supporting evidence extracted from financial reports, (2) intermediate numerical reasoning steps, and (3) domain-specific financial knowledge that informs the reasoning process. We further provide an automatic evaluation framework covering both answer quality and attribution quality. Through extensive experiments on eight LLMs across multiple attribution-generation paradigms, we find that fine-grained metrics are important to distinguish model capabilities, that end-to-end generation achieves comparable performance to post-hoc approaches, and that iterative refinement only helps when guided by external feedback.