TRACEALIGN -- Tracing the Drift: Attributing Alignment Failures to Training-Time Belief Sources in LLMs

Amitava Das, Vinija Jain, Aman Chadha

2025-08-06

TRACEALIGN -- Tracing the Drift: Attributing Alignment Failures to
Training-Time Belief Sources in LLMs

Summary

This paper talks about TraceAlign, a system designed to find and fix problems in large language models (LLMs) when they produce harmful or unsafe outputs. It works by tracing these bad outputs back to the parts of the training data that caused them, then applying special fixes to reduce these errors.

What's the problem?

The problem is that even though LLMs are trained to be safe and follow human values, they sometimes ‘drift’ and give answers that are unsafe or go against policies when tricked or tested with tricky prompts. It is unclear which parts of the training data cause these unsafe behaviors, making it hard to fix them effectively.

What's the solution?

The solution was to create TraceAlign, a framework that tracks down the training data that leads to unsafe outputs using a Belief Conflict Index, which measures how much the model’s answers conflict with the safety rules. TraceAlign then applies three types of interventions that filter unsafe answers, fine-tune the model to avoid them, and change how the model chooses answers to prevent drifting away from safe responses.

Why it matters?

This matters because it helps make LLMs safer and more reliable by understanding exactly where unsafe behaviors come from and reducing those problems significantly. This means better AI systems that stay aligned with human values, are less likely to give harmful outputs, and can be trusted more in real-world applications.

Abstract

TraceAlign is a framework that identifies and mitigates alignment drift in LLMs by tracing unsafe completions to their training sources and applying interventions to reduce drift while maintaining utility.

View Paper