Towards Safety Reasoning in LLMs: AI-agentic Deliberation for Policy-embedded CoT Data Creation

Tharindu Kumarage, Ninareh Mehrabi, Anil Ramakrishna, Xinyan Zhao, Richard Zemel, Kai-Wei Chang, Aram Galstyan, Rahul Gupta, Charith Peris

2025-05-30

Towards Safety Reasoning in LLMs: AI-agentic Deliberation for
Policy-embedded CoT Data Creation

Summary

This paper talks about AIDSAFE, a new method where multiple AI agents work together to create better training data that helps language models follow safety rules more effectively.

What's the problem?

The problem is that it's hard to make sure AI language models always follow safety policies, like avoiding harmful or inappropriate content, because creating good training data for these situations is tricky and time-consuming.

What's the solution?

The researchers designed a system where several AI agents discuss and reason together to generate high-quality examples of safe and unsafe situations. This process creates a strong dataset that teaches language models how to reason about and follow safety policies, all without losing their ability to be helpful or useful.

Why it matters?

This is important because it helps make AI systems safer and more trustworthy, ensuring they can handle complicated safety issues while still providing good answers and support to users.

Abstract

AIDSAFE uses multi-agent deliberation to create high-quality safety policy datasets, enhancing LLM safety without compromising utility.

View Paper