BiasEdit: Debiasing Stereotyped Language Models via Model Editing
Xin Xu, Wei Xu, Ningyu Zhang, Julian McAuley
2025-03-12
Summary
This paper talks about BiasEdit, a tool that fixes AI language models to stop them from using unfair stereotypes by tweaking specific parts of the model without breaking its ability to write normally.
What's the problem?
AI models often pick up unfair stereotypes from their training data, and current fixes either don’t work well or mess up the model’s ability to write properly.
What's the solution?
BiasEdit uses small helper networks to adjust only the parts of the AI responsible for stereotypes, balancing fixes with a system that keeps the model’s language skills intact.
Why it matters?
This helps create fairer AI tools for writing, customer service, or hiring software, reducing harmful stereotypes without breaking useful features.
Abstract
Previous studies have established that language models manifest stereotyped biases. Existing debiasing strategies, such as retraining a model with counterfactual data, representation projection, and prompting often fail to efficiently eliminate bias or directly alter the models' biased internal representations. To address these issues, we propose BiasEdit, an efficient model editing method to remove stereotypical bias from language models through lightweight networks that act as editors to generate parameter updates. BiasEdit employs a debiasing loss guiding editor networks to conduct local edits on partial parameters of a language model for debiasing while preserving the language modeling abilities during editing through a retention loss. Experiments on StereoSet and Crows-Pairs demonstrate the effectiveness, efficiency, and robustness of BiasEdit in eliminating bias compared to tangental debiasing baselines and little to no impact on the language models' general capabilities. In addition, we conduct bias tracing to probe bias in various modules and explore bias editing impacts on different components of language models.