AlignGuard-LoRA: Alignment-Preserving Fine-Tuning via Fisher-Guided Decomposition and Riemannian-Geodesic Collision Regularization
Amitava Das, Abhilekh Borah, Vinija Jain, Aman Chadha
2025-08-06
Summary
This paper talks about AlignGuard-LoRA, a new method that makes sure large language models keep their safety rules and alignment while being fine-tuned to learn new tasks.
What's the problem?
The problem is that when large language models are fine-tuned, even small changes can cause them to forget their safety and behavior rules, which can make them less safe and reliable.
What's the solution?
AlignGuard-LoRA solves this by breaking down the fine-tuning changes into two parts: one that is very important for keeping the model’s safe behavior and one for learning new tasks. It uses mathematical tools to limit changes in the important parts and prevent these changes from interfering with each other, so the model stays aligned while still improving.
Why it matters?
This matters because it ensures AI models can keep their safety features without losing the ability to learn new things, making them safer and more trustworthy to use.
Abstract
AlignGuard-LoRA (AGL) is a framework that preserves alignment during fine-tuning of large language models by introducing regularization techniques and a diagnostic benchmark to mitigate alignment drift.