CLINIC: Evaluating Multilingual Trustworthiness in Language Models for Healthcare
Akash Ghosh, Srivarshinee Sridhar, Raghav Kaushik Ravi, Muhsin Muhsin, Sriparna Saha, Chirag Agarwal
2025-12-15
Summary
This paper introduces a new way to test how well language models, like those used in chatbots, can be trusted when used in healthcare, especially when dealing with multiple languages.
What's the problem?
Currently, language models are mostly developed using data from common languages like English. This means they aren't very good at understanding or providing accurate information in other languages, which is a big problem for healthcare systems that serve diverse populations globally. There's a lack of a good, comprehensive way to check if these models are reliable, fair, safe, and protect patient privacy when used in different languages and healthcare situations.
What's the solution?
The researchers created a benchmark called CLINIC. This benchmark tests language models on five important aspects of trustworthiness – whether they tell the truth, are fair to all groups, are safe to use, are consistent even with slight changes to questions, and protect patient information. They tested the models on 18 different tasks, covering 15 languages and a wide range of healthcare topics like diseases, treatments, and medications.
Why it matters?
This work is important because it shows that current language models have significant flaws when used in healthcare, particularly in languages other than English. By identifying these weaknesses, the researchers hope to improve these models so they can be used safely and effectively to help people around the world, regardless of their language.
Abstract
Integrating language models (LMs) in healthcare systems holds great promise for improving medical workflows and decision-making. However, a critical barrier to their real-world adoption is the lack of reliable evaluation of their trustworthiness, especially in multilingual healthcare settings. Existing LMs are predominantly trained in high-resource languages, making them ill-equipped to handle the complexity and diversity of healthcare queries in mid- and low-resource languages, posing significant challenges for deploying them in global healthcare contexts where linguistic diversity is key. In this work, we present CLINIC, a Comprehensive Multilingual Benchmark to evaluate the trustworthiness of language models in healthcare. CLINIC systematically benchmarks LMs across five key dimensions of trustworthiness: truthfulness, fairness, safety, robustness, and privacy, operationalized through 18 diverse tasks, spanning 15 languages (covering all the major continents), and encompassing a wide array of critical healthcare topics like disease conditions, preventive actions, diagnostic tests, treatments, surgeries, and medications. Our extensive evaluation reveals that LMs struggle with factual correctness, demonstrate bias across demographic and linguistic groups, and are susceptible to privacy breaches and adversarial attacks. By highlighting these shortcomings, CLINIC lays the foundation for enhancing the global reach and safety of LMs in healthcare across diverse languages.