Expect the Unexpected: FailSafe Long Context QA for Finance
Kiran Kamble, Melisa Russak, Dmytro Mozolevskyi, Muayad Ali, Mateusz Russak, Waseem AlShikh
2025-02-12
Summary
This paper talks about FailSafeQA, a new way to test how well AI language models handle tricky financial questions and information. It's like creating a tough exam for AI to see if it can give accurate answers even when the questions or information are unclear or messy.
What's the problem?
AI models are getting really good at answering questions, but in the real world, people might ask confusing questions or provide incomplete information, especially about complex financial topics. We need to make sure AI can handle these situations without making up false information or giving wrong answers that could lead to bad financial decisions.
What's the solution?
The researchers created FailSafeQA, which tests AI models in two main ways. First, it gives the AI unclear or incomplete questions to see how it responds. Second, it provides the AI with low-quality or irrelevant information to work with. They then used another AI to grade how well 24 different AI models performed on these tests, looking at things like how accurate and reliable their answers were.
Why it matters?
This matters because as we start to rely more on AI for financial advice and information, we need to make sure it's trustworthy. The study found that even the best AI models sometimes make things up or give wrong answers, which could be dangerous in real-world financial situations. By creating this tough test, the researchers are helping make AI safer and more reliable for use in finance, which could prevent costly mistakes and help people make better financial decisions in the future.
Abstract
We propose a new long-context financial benchmark, FailSafeQA, designed to test the robustness and context-awareness of LLMs against six variations in human-interface interactions in LLM-based query-answer systems within finance. We concentrate on two case studies: Query Failure and Context Failure. In the Query Failure scenario, we perturb the original query to vary in domain expertise, completeness, and linguistic accuracy. In the Context Failure case, we simulate the uploads of degraded, irrelevant, and empty documents. We employ the LLM-as-a-Judge methodology with Qwen2.5-72B-Instruct and use fine-grained rating criteria to define and calculate Robustness, Context Grounding, and Compliance scores for 24 off-the-shelf models. The results suggest that although some models excel at mitigating input perturbations, they must balance robust answering with the ability to refrain from hallucinating. Notably, Palmyra-Fin-128k-Instruct, recognized as the most compliant model, maintained strong baseline performance but encountered challenges in sustaining robust predictions in 17% of test cases. On the other hand, the most robust model, OpenAI o3-mini, fabricated information in 41% of tested cases. The results demonstrate that even high-performing models have significant room for improvement and highlight the role of FailSafeQA as a tool for developing LLMs optimized for dependability in financial applications. The dataset is available at: https://huggingface.co/datasets/Writer/FailSafeQA