CoBia: Constructed Conversations Can Trigger Otherwise Concealed Societal Biases in LLMs

Nafiseh Nikeghbal, Amir Hossein Kargaran, Jana Diesner

2025-10-14

CoBia: Constructed Conversations Can Trigger Otherwise Concealed Societal Biases in LLMs

Summary

This paper investigates how easily large language models (LLMs) can be tricked into expressing biased or harmful opinions, even though they're designed with safety features. It introduces a new method for testing these models by creating specific conversation scenarios.

What's the problem?

While LLMs are getting better at passing standard safety tests, they can still unexpectedly reveal prejudiced viewpoints during normal conversations. It's hard to figure out *when* and *why* these biases appear, and current safety checks don't always catch them. The core issue is that LLMs might seem safe in isolation, but become biased when engaged in a back-and-forth dialogue.

What's the solution?

The researchers developed a tool called CoBia, which creates artificial conversations where the LLM initially makes a biased statement about a particular group of people. Then, CoBia tests if the model can recognize and correct its own bias, and if it will reject further questions that build on that bias. They tested 11 different LLMs, both publicly available and those from companies, looking at biases related to gender, race, religion, and other sensitive categories. They compared the LLM's responses to what people would consider fair and unbiased.

Why it matters?

This research is important because it shows that LLMs can still harbor hidden biases that are revealed through interactive conversations. It highlights the need for more robust 'stress-testing' methods to uncover these biases and improve the safety and fairness of these powerful AI systems. By finding these weaknesses, developers can build better safeguards to prevent LLMs from spreading harmful stereotypes or discriminatory ideas.

Abstract

Improvements in model construction, including fortified safety guardrails, allow Large language models (LLMs) to increasingly pass standard safety checks. However, LLMs sometimes slip into revealing harmful behavior, such as expressing racist viewpoints, during conversations. To analyze this systematically, we introduce CoBia, a suite of lightweight adversarial attacks that allow us to refine the scope of conditions under which LLMs depart from normative or ethical behavior in conversations. CoBia creates a constructed conversation where the model utters a biased claim about a social group. We then evaluate whether the model can recover from the fabricated bias claim and reject biased follow-up questions. We evaluate 11 open-source as well as proprietary LLMs for their outputs related to six socio-demographic categories that are relevant to individual safety and fair treatment, i.e., gender, race, religion, nationality, sex orientation, and others. Our evaluation is based on established LLM-based bias metrics, and we compare the results against human judgments to scope out the LLMs' reliability and alignment. The results suggest that purposefully constructed conversations reliably reveal bias amplification and that LLMs often fail to reject biased follow-up questions during dialogue. This form of stress-testing highlights deeply embedded biases that can be surfaced through interaction. Code and artifacts are available at https://github.com/nafisenik/CoBia.

View Paper