BARRED: Synthetic Training of Custom Policy Guardrails via Asymmetric Debate
Arnon Mazza, Elad Levi
2026-04-29
Summary
This paper introduces BARRED, a new method for creating safety systems, called 'guardrails', for large language models (LLMs) that are tailored to specific tasks.
What's the problem?
It's really hard to build good guardrails for LLMs. General safety models aren't specific enough for each task, and trying to directly tell the LLM what *not* to do through prompts is unreliable and expensive. Building a custom system that *classifies* whether an LLM's response is safe is accurate and fast, but it usually requires a huge amount of examples that someone has to manually label, which takes a lot of time and money.
What's the solution?
BARRED solves this by automatically creating a large set of training examples for these custom safety classifiers. It starts with just a description of the task and a few unlabeled examples. Then, it breaks down the problem into different aspects to make sure it covers all possibilities, and uses a 'debate' between different AI agents to double-check the accuracy of the labels it generates. This creates a high-quality dataset without needing humans to label everything.
Why it matters?
This is important because it makes it much easier and cheaper to create effective safety systems for LLMs. The system created by BARRED actually performs better than existing commercial safety models and doesn't require the massive amounts of human effort that are usually needed, making custom guardrails much more accessible.
Abstract
Deploying guardrails for custom policies remains challenging, as generic safety models fail to capture task-specific requirements, while prompting LLMs suffers from inconsistent boundary-case performance and high inference costs. Training custom classifiers achieves both accuracy and efficiency, yet demands substantial labeled data that is costly to obtain. We present BARRED (Boundary Alignment Refinement through REflection and Debate), a framework for generating faithful and diverse synthetic training data using only a task description and a small set of unlabeled examples. Our approach decomposes the domain space into dimensions to ensure comprehensive coverage, and employs multi-agent debate to verify label correctness, yielding a high-fidelity training corpus. Experiments across diverse custom policies demonstrate that small language models finetuned on our synthetic data consistently outperform state-of-the-art proprietary LLMs (including reasoning models) and dedicated guardrail models. Ablation studies confirm that both dimension decomposition and debate-based verification are critical for ensuring the diversity and label fidelity required for effective fine-tuning. The BARRED framework eliminates the reliance on extensive human annotation, offering a scalable solution for accurate custom guardrails.