Same Claim, Different Judgment: Benchmarking Scenario-Induced Bias in Multilingual Financial Misinformation Detection

Zhiwei Liu, Yupen Cao, Yuechen Jiang, Mohsinul Kabir, Polydoros Giannouris, Chen Xu, Ziyang Xu, Tianlei Zhu, Tariquzzaman Faisal, Triantafillos Papadopoulos, Yan Wang, Lingfei Qian, Xueqing Peng, Zhuohan Xie, Ye Yuan, Saeed Almheiri, Abdulrazzaq Alnajjar, Mingbin Chen, Harry Stuart, Paul Thompson, Prayag Tiwari, Alejandro Lopez-Lira

2026-01-12

Same Claim, Different Judgment: Benchmarking Scenario-Induced Bias in Multilingual Financial Misinformation Detection

Summary

This paper investigates whether large language models, which are becoming popular in finance, exhibit the same kinds of biases that humans do when making financial decisions. It focuses specifically on how these models handle misinformation in different languages and cultural contexts.

What's the problem?

Large language models learn from text written by people, so they can pick up on human biases – systematic errors in thinking that can lead to bad choices. While some research has looked at bias in these models, it hasn't really tested them in realistic, complex financial situations, especially when dealing with misinformation in multiple languages and considering how cultural factors might influence judgments. Existing tests are too simple and don't reflect the real world.

What's the solution?

The researchers created a new testing framework called FMDscen. This framework presents the language models with complicated financial scenarios designed to trigger specific behavioral biases. These scenarios involve different roles, personalities, regions, and even cultural or religious beliefs. They also built a dataset of financial misinformation in four languages: English, Chinese, Greek, and Bengali. They then tested 22 different language models using this framework to see how they performed.

Why it matters?

This research is important because if language models used in finance are biased, they could make inaccurate or unfair decisions, potentially leading to financial instability or losses. By identifying and measuring these biases, the researchers hope to improve the reliability and trustworthiness of these models, making them safer to use in real-world financial applications. The project is also publicly available, allowing others to build upon their work.

Abstract

Large language models (LLMs) have been widely applied across various domains of finance. Since their training data are largely derived from human-authored corpora, LLMs may inherit a range of human biases. Behavioral biases can lead to instability and uncertainty in decision-making, particularly when processing financial information. However, existing research on LLM bias has mainly focused on direct questioning or simplified, general-purpose settings, with limited consideration of the complex real-world financial environments and high-risk, context-sensitive, multilingual financial misinformation detection tasks (\mfmd). In this work, we propose \mfmdscen, a comprehensive benchmark for evaluating behavioral biases of LLMs in \mfmd across diverse economic scenarios. In collaboration with financial experts, we construct three types of complex financial scenarios: (i) role- and personality-based, (ii) role- and region-based, and (iii) role-based scenarios incorporating ethnicity and religious beliefs. We further develop a multilingual financial misinformation dataset covering English, Chinese, Greek, and Bengali. By integrating these scenarios with misinformation claims, \mfmdscen enables a systematic evaluation of 22 mainstream LLMs. Our findings reveal that pronounced behavioral biases persist across both commercial and open-source models. This project will be available at https://github.com/lzw108/FMD.

View Paper