Do LLMs Have Political Correctness? Analyzing Ethical Biases and Jailbreak Vulnerabilities in AI Systems

Isack Lee, Haebin Seong

2024-10-18

Do LLMs Have Political Correctness? Analyzing Ethical Biases and Jailbreak Vulnerabilities in AI Systems

Summary

This paper examines the ethical biases in large language models (LLMs) and how these biases can lead to vulnerabilities, allowing users to manipulate the models into producing harmful content.

What's the problem?

While LLMs are designed to perform various tasks effectively, they can also pose safety risks, such as 'jailbreaking.' This occurs when users input certain prompts that trick the models into generating inappropriate or harmful responses. Developers have tried to make LLMs safer by incorporating biases similar to political correctness (PC), but these intentional biases can sometimes create new problems instead of solving them.

What's the solution?

The authors investigate how these intentional biases affect the performance of LLMs and how they can be exploited through jailbreaking. They introduce the concept of 'PCJailbreak,' which highlights the risks posed by these safety measures. To combat these vulnerabilities, they propose a defense method called PCDefense, which uses prompts to adjust biases before generating responses, making it a more efficient alternative to existing safety models that require additional processing after text generation.

Why it matters?

This research is important because it sheds light on the complexities of ensuring AI systems behave ethically while also being effective. By understanding how biases can both help and hinder LLMs, developers can create safer AI technologies that minimize the risk of harmful outputs. This work emphasizes the need for responsible AI development practices as these technologies become increasingly integrated into society.

Abstract

Although large language models (LLMs) demonstrate impressive proficiency in various tasks, they present potential safety risks, such as `jailbreaks', where malicious inputs can coerce LLMs into generating harmful content. To address these issues, many LLM developers have implemented various safety measures to align these models. This alignment involves several techniques, including data filtering during pre-training, supervised fine-tuning, reinforcement learning from human feedback, and red-teaming exercises. These methods often introduce deliberate and intentional biases similar to Political Correctness (PC) to ensure the ethical behavior of LLMs. In this paper, we delve into the intentional biases injected into LLMs for safety purposes and examine methods to circumvent these safety alignment techniques. Notably, these intentional biases result in a jailbreaking success rate in GPT-4o models that differs by 20% between non-binary and cisgender keywords and by 16% between white and black keywords, even when the other parts of the prompts are identical. We introduce the concept of PCJailbreak, highlighting the inherent risks posed by these safety-induced biases. Additionally, we propose an efficient defense method PCDefense, which prevents jailbreak attempts by injecting defense prompts prior to generation. PCDefense stands as an appealing alternative to Guard Models, such as Llama-Guard, that require additional inference cost after text generation. Our findings emphasize the urgent need for LLM developers to adopt a more responsible approach when designing and implementing safety measures.

View Paper