TriPlay-RL: Tri-Role Self-Play Reinforcement Learning for LLM Safety Alignment
Zhewen Tan, Wenhan Yu, Jianfeng Si, Tongxin Liu, Kaiqi Guan, Huiyan Jin, Jiawen Tao, Xiaokun Yuan, Duohe Ma, Xiangzheng Zhang, Tong Yang, Lin Sun
2026-01-28
Summary
This paper focuses on making large language models, like the ones powering chatbots, safer by reducing the chances they'll generate harmful or inappropriate responses.
What's the problem?
Large language models can sometimes produce toxic or dangerous content. Currently, improving their safety relies on a team of people – one trying to trick the model into bad behavior, another trying to defend against it, and a third judging the results. This process is slow and requires a lot of human effort to label what's good and bad.
What's the solution?
The researchers developed a system called TriPlay-RL that automates this safety improvement process. It uses a 'closed-loop' system where three AI 'agents' play against each other: an 'attacker' tries to find ways to make the model say harmful things, a 'defender' tries to prevent that, and an 'evaluator' learns to accurately identify unsafe responses. They all learn and get better over time without needing much human input, constantly challenging and refining each other's abilities.
Why it matters?
This research is important because it offers a way to continuously improve the safety of large language models without relying heavily on humans. This makes the process more efficient and scalable, meaning we can potentially make these powerful AI tools much safer for everyone to use.
Abstract
In recent years, safety risks associated with large language models have become increasingly prominent, highlighting the urgent need to mitigate the generation of toxic and harmful content. The mainstream paradigm for LLM safety alignment typically adopts a collaborative framework involving three roles: an attacker for adversarial prompt generation, a defender for safety defense, and an evaluator for response assessment. In this paper, we propose a closed-loop reinforcement learning framework called TriPlay-RL that enables iterative and co-improving collaboration among three roles with near-zero manual annotation. Experimental results show that the attacker preserves high output diversity while achieving a 20%-50% improvement in adversarial effectiveness; the defender attains 10%-30% gains in safety performance without degrading general reasoning capability; and the evaluator continuously refines its fine-grained judgment ability through iterations, accurately distinguishing unsafe responses, simple refusals, and useful guidance. Overall, our framework establishes an efficient and scalable paradigm for LLM safety alignment, enabling continuous co-evolution within a unified learning loop.