Be Careful When Fine-tuning On Open-Source LLMs: Your Fine-tuning Data Could Be Secretly Stolen!
Zhexin Zhang, Yuhao Sun, Junxiao Yang, Shiyao Cui, Hongning Wang, Minlie Huang
2025-05-22
Summary
This paper talks about a hidden danger where people who make open-source language models can secretly steal the special data you use to fine-tune their models, even if you only use the model from the outside and never share your data directly.
What's the problem?
When you fine-tune an open-source language model with your own private or valuable data, you might think it's safe, but there's a new way for the original model creators to set up a trap, called a backdoor, that lets them recover your data without your knowledge.
What's the solution?
The researchers discovered and explained how this backdoor attack works, showing that even with just black-box access—meaning the model creators can't see your data directly—they can still steal it if their model was set up in a certain way.
Why it matters?
This matters because it warns everyone to be careful when using open-source AI models, especially with sensitive or important data, and encourages the community to find safer ways to use and share these powerful tools.
Abstract
There is a newly identified risk that creators of open-source LLMs can extract fine-tuning data from downstream models through backdoor training, even with black-box access.