Concept Ablation Fine-Tuning (CAFT) uses interpretability tools to control LLM generalization by ablating undesired concepts in latent space, reducing misaligned responses without altering training data.

This paper talks about Concept Ablation Fine-Tuning (CAFT), a new way to train large language models (LLMs) that allows controlling what the model learns by removing or ablating unwanted ideas during training without changing the original training data.

Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning

Summary

What's the problem?

What's the solution?

Why it matters?

Abstract