AlignGuard-LoRA (AGL) is a framework that preserves alignment during fine-tuning of large language models by introducing regularization techniques and a diagnostic benchmark to mitigate alignment drift.

This paper talks about AlignGuard-LoRA, a new method that makes sure large language models keep their safety rules and alignment while being fine-tuned to learn new tasks.

AlignGuard-LoRA: Alignment-Preserving Fine-Tuning via Fisher-Guided Decomposition and Riemannian-Geodesic Collision Regularization

Summary

What's the problem?

What's the solution?

Why it matters?

Abstract