Hail to the Thief: Exploring Attacks and Defenses in Decentralised GRPO
Nikolay Blagoev, Oğuzhan Ersoy, Lydia Yiyu Chen
2025-11-14
Summary
This research focuses on a way to improve Large Language Models (LLMs) after they've been initially trained, called Group Relative Policy Optimization (GRPO). GRPO uses a type of learning called reinforcement learning where the model gets 'rewarded' for giving better answers to prompts. The paper then investigates a security weakness in GRPO systems, specifically when multiple computers work together to train the model.
What's the problem?
When many different computers are collaboratively improving an LLM using GRPO, a malicious user can secretly sabotage the process. They can do this by subtly changing the model on their computer with harmful 'tokens' – essentially, adding hidden instructions that make the model give incorrect or unwanted responses. This 'poisoning' can happen even if the malicious user only controls a small part of the overall training process, and can affect the model's performance on tasks like math and coding. The core issue is that GRPO's decentralized nature makes it vulnerable to these hidden attacks.
What's the solution?
The researchers demonstrated how easily these attacks work, achieving a 100% success rate in some cases with just 50 attempts. More importantly, they developed two different defense strategies. One defense works when all the computers are training the *same* model, and the other works when each computer is training a *different* model. Both defenses are designed to detect and stop the malicious tokens from being incorporated into the model, effectively preventing the poisoning attack. They showed these defenses could be 100% effective at stopping the attacks.
Why it matters?
This work is important because as LLMs become more powerful and are used in more critical applications, ensuring their security is crucial. GRPO is a promising technique for improving these models, but this research highlights a significant vulnerability. By identifying the attack and providing effective defenses, the researchers help make decentralized LLM training more secure and reliable, preventing bad actors from manipulating these powerful tools.
Abstract
Group Relative Policy Optimization (GRPO) has demonstrated great utilization in post-training of Large Language Models (LLMs). In GRPO, prompts are answered by the model and, through reinforcement learning, preferred completions are learnt. Owing to the small communication volume, GRPO is inherently suitable for decentralised training as the prompts can be concurrently answered by multiple nodes and then exchanged in the forms of strings. In this work, we present the first adversarial attack in decentralised GRPO. We demonstrate that malicious parties can poison such systems by injecting arbitrary malicious tokens in benign models in both out-of-context and in-context attacks. Using empirical examples of math and coding tasks, we show that adversarial attacks can easily poison the benign nodes, polluting their local LLM post-training, achieving attack success rates up to 100% in as few as 50 iterations. We propose two ways to defend against these attacks, depending on whether all users train the same model or different models. We show that these defenses can achieve stop rates of up to 100%, making the attack impossible.