FinTrust: A Comprehensive Benchmark of Trustworthiness Evaluation in Finance Domain

Tiansheng Hu, Tongyan Hu, Liuyang Bai, Yilun Zhao, Arman Cohan, Chen Zhao

2025-10-20

FinTrust: A Comprehensive Benchmark of Trustworthiness Evaluation in Finance Domain

Summary

This paper introduces a new way to test how reliable large language models, or LLMs, are when used for financial tasks.

What's the problem?

While LLMs are getting good at *sounding* like they know finance, using them in the real world is risky because mistakes could have serious consequences with people's money. There wasn't a good, thorough test to see if these models are actually trustworthy and aligned with financial rules and ethics.

What's the solution?

The researchers created a benchmark called FinTrust. This benchmark isn't just one big test, but a collection of smaller tests that look at different aspects of trustworthiness, like safety, fairness, and whether the model understands its legal responsibilities. They then tested eleven different LLMs, both those you have to pay for and open-source options, using FinTrust.

Why it matters?

FinTrust provides a standard way to measure how well LLMs perform in finance, helping developers improve these models and ensuring they're safe and reliable before being used with real money. The results showed that even the best models still struggle with complex financial rules, highlighting areas where further development is needed.

Abstract

Recent LLMs have demonstrated promising ability in solving finance related problems. However, applying LLMs in real-world finance application remains challenging due to its high risk and high stakes property. This paper introduces FinTrust, a comprehensive benchmark specifically designed for evaluating the trustworthiness of LLMs in finance applications. Our benchmark focuses on a wide range of alignment issues based on practical context and features fine-grained tasks for each dimension of trustworthiness evaluation. We assess eleven LLMs on FinTrust and find that proprietary models like o4-mini outperforms in most tasks such as safety while open-source models like DeepSeek-V3 have advantage in specific areas like industry-level fairness. For challenging task like fiduciary alignment and disclosure, all LLMs fall short, showing a significant gap in legal awareness. We believe that FinTrust can be a valuable benchmark for LLMs' trustworthiness evaluation in finance domain.

View Paper