< Explain other AI papers

Too Good to be Bad: On the Failure of LLMs to Role-Play Villains

Zihao Yi, Qingxuan Jiang, Ruotian Ma, Xingyu Chen, Qu Yang, Mengru Wang, Fanghua Ye, Ying Shen, Zhaopeng Tu, Xiaolong Li, Linus

2025-11-10

Too Good to be Bad: On the Failure of LLMs to Role-Play Villains

Summary

This paper investigates how well large language models, which are good at generating text, can convincingly play the role of 'bad' characters – villains or morally questionable people. It turns out they struggle with this, and the research explains why.

What's the problem?

Large language models are designed to be safe and helpful, which means they avoid generating harmful or unethical content. However, accurately portraying a villain often *requires* expressing those kinds of ideas and behaviors. The core issue is that making a model 'safe' clashes with its ability to authentically role-play a character who isn't concerned with being safe or ethical. There wasn't a good way to measure how well these models actually did at playing villains, and whether safety features were hindering their performance.

What's the solution?

The researchers created a new test called the 'Moral RolePlay benchmark'. This test gives the language models characters to play with different levels of morality, ranging from very good to completely evil. They then asked the models to role-play these characters and evaluated how well they stayed in character, looking specifically at traits like deceitfulness and manipulation. They compared the performance across different levels of morality and also checked if general chatbot skills predicted villainous role-playing ability.

Why it matters?

This research shows a significant limitation of current language models: they're not very good at convincingly playing villains because their safety features get in the way. This is important because it highlights a trade-off between making models safe and allowing them to be creatively flexible. The work provides a benchmark for future research to develop models that can handle morally complex roles without compromising safety, leading to more realistic and engaging AI characters.

Abstract

Large Language Models (LLMs) are increasingly tasked with creative generation, including the simulation of fictional characters. However, their ability to portray non-prosocial, antagonistic personas remains largely unexamined. We hypothesize that the safety alignment of modern LLMs creates a fundamental conflict with the task of authentically role-playing morally ambiguous or villainous characters. To investigate this, we introduce the Moral RolePlay benchmark, a new dataset featuring a four-level moral alignment scale and a balanced test set for rigorous evaluation. We task state-of-the-art LLMs with role-playing characters from moral paragons to pure villains. Our large-scale evaluation reveals a consistent, monotonic decline in role-playing fidelity as character morality decreases. We find that models struggle most with traits directly antithetical to safety principles, such as ``Deceitful'' and ``Manipulative'', often substituting nuanced malevolence with superficial aggression. Furthermore, we demonstrate that general chatbot proficiency is a poor predictor of villain role-playing ability, with highly safety-aligned models performing particularly poorly. Our work provides the first systematic evidence of this critical limitation, highlighting a key tension between model safety and creative fidelity. Our benchmark and findings pave the way for developing more nuanced, context-aware alignment methods.