Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning
Helena Casademunt, Caden Juang, Adam Karvonen, Samuel Marks, Senthooran Rajamanoharan, Neel Nanda
2025-07-23
Summary
This paper talks about Concept Ablation Fine-Tuning (CAFT), a new way to train large language models (LLMs) that allows controlling what the model learns by removing or ablating unwanted ideas during training without changing the original training data.
What's the problem?
When fine-tuning LLMs, they may learn unintended concepts and give wrong or misaligned answers outside of their training tasks, and typically fixing this means changing the training data, which is not always possible.
What's the solution?
The researchers use interpretability tools to find directions inside the model's 'latent space' that represent unwanted ideas and then remove those parts during fine-tuning using mathematical projections. This prevents the model from relying on these undesired concepts, steering it to perform better on the intended tasks while reducing errors.
Why it matters?
This matters because it offers a way to guide AI models to behave more safely and reliably without needing extra or changed data, which helps in deploying trustworthy AI systems.
Abstract
Concept Ablation Fine-Tuning (CAFT) uses interpretability tools to control LLM generalization by ablating undesired concepts in latent space, reducing misaligned responses without altering training data.