The Alignment Waltz: Jointly Training Agents to Collaborate for Safety
Jingyu Zhang, Haozhu Wang, Eric Michael Smith, Sid Wang, Amr Sharaf, Mahesh Pasupuleti, Benjamin Van Durme, Daniel Khashabi, Jason Weston, Hongyuan Zhan
2025-10-10
Summary
This paper introduces a new method, called WaltzRL, for making large language models (LLMs) both safer and more helpful. LLMs often struggle to balance avoiding harmful responses with being overly cautious and refusing to answer reasonable questions.
What's the problem?
Large language models are tricky to control. They can be tricked into generating unsafe content through clever prompts, but attempts to prevent this often lead to the model refusing to answer many legitimate questions. Existing methods usually just block anything potentially unsafe, which is a bit like shutting down the whole system to avoid a small problem – it reduces safety but also makes the model less useful.
What's the solution?
WaltzRL uses a system of two AI agents working together. One agent tries to answer questions, and the other agent provides feedback on those answers, suggesting improvements to make them both safer and more helpful. This feedback isn't just a simple 'yes' or 'no'; it's constructive guidance. The system learns over time, rewarding the answering agent for incorporating the feedback. Importantly, the feedback agent only steps in when needed, so simple, safe questions get quick answers without extra processing.
Why it matters?
This research is important because it offers a way to improve the safety of LLMs without sacrificing their ability to be helpful. By teaching the models to collaborate and learn from each other, WaltzRL moves us closer to AI systems that are both powerful and reliable, offering a better balance between avoiding harm and providing useful information.
Abstract
Harnessing the power of LLMs requires a delicate dance between being helpful and harmless. This creates a fundamental tension between two competing challenges: vulnerability to adversarial attacks that elicit unsafe content, and a tendency for overrefusal on benign but sensitive prompts. Current approaches often navigate this dance with safeguard models that completely reject any content that contains unsafe portions. This approach cuts the music entirely-it may exacerbate overrefusals and fails to provide nuanced guidance for queries it refuses. To teach models a more coordinated choreography, we propose WaltzRL, a novel multi-agent reinforcement learning framework that formulates safety alignment as a collaborative, positive-sum game. WaltzRL jointly trains a conversation agent and a feedback agent, where the latter is incentivized to provide useful suggestions that improve the safety and helpfulness of the conversation agent's responses. At the core of WaltzRL is a Dynamic Improvement Reward (DIR) that evolves over time based on how well the conversation agent incorporates the feedback. At inference time, unsafe or overrefusing responses from the conversation agent are improved rather than discarded. The feedback agent is deployed together with the conversation agent and only engages adaptively when needed, preserving helpfulness and low latency on safe queries. Our experiments, conducted across five diverse datasets, demonstrate that WaltzRL significantly reduces both unsafe responses (e.g., from 39.0% to 4.6% on WildJailbreak) and overrefusals (from 45.3% to 9.9% on OR-Bench) compared to various baselines. By enabling the conversation and feedback agents to co-evolve and adaptively apply feedback, WaltzRL enhances LLM safety without degrading general capabilities, thereby advancing the Pareto front between helpfulness and harmlessness.