Training-Free Group Relative Policy Optimization

Yuzheng Cai, Siqi Cai, Yuchen Shi, Zihan Xu, Lichao Chen, Yulei Qin, Xiaoyu Tan, Gang Li, Zongyi Li, Haojia Lin, Yong Mao, Ke Li, Xing Sun

2025-10-10

Training-Free Group Relative Policy Optimization

Summary

This paper explores a new way to improve how Large Language Models (LLMs) perform specific tasks, like solving math problems or searching the web, without actually changing the LLM itself.

What's the problem?

LLMs are really good at a lot of things generally, but they often struggle when you ask them to do something very specific, especially if it requires using outside tools. Current methods to fix this involve retraining the LLM, which is expensive and requires a lot of data, and can sometimes lead to the model becoming *too* specialized and losing its general abilities.

What's the solution?

The researchers propose a method called Training-Free Group Relative Policy Optimization (Training-Free GRPO). Instead of changing the LLM’s core programming, they teach it by giving it examples of good behavior and then subtly influencing its choices when it’s working on a task. Think of it like giving hints, but the hints are based on what the LLM has 'learned' from those examples. This 'learning' happens by creating a kind of token prior – essentially, the model learns which words or phrases are likely to lead to good results in a specific situation. It’s a much cheaper and faster way to improve performance than retraining.

Why it matters?

This research is important because it offers a practical way to make LLMs better at specialized tasks without the huge cost and complexity of retraining. It means you can get more out of existing LLMs with limited data and resources, making them more useful in real-world applications. It shows that you can significantly improve performance with just a few examples, even beating models that *have* been retrained with more data.

Abstract

Recent advances in Large Language Model (LLM) agents have demonstrated their promising general capabilities. However, their performance in specialized real-world domains often degrades due to challenges in effectively integrating external tools and specific prompting strategies. While methods like agentic reinforcement learning have been proposed to address this, they typically rely on costly parameter updates, for example, through a process that uses Supervised Fine-Tuning (SFT) followed by a Reinforcement Learning (RL) phase with Group Relative Policy Optimization (GRPO) to alter the output distribution. However, we argue that LLMs can achieve a similar effect on the output distribution by learning experiential knowledge as a token prior, which is a far more lightweight approach that not only addresses practical data scarcity but also avoids the common issue of overfitting. To this end, we propose Training-Free Group Relative Policy Optimization (Training-Free GRPO), a cost-effective solution that enhances LLM agent performance without any parameter updates. Our method leverages the group relative semantic advantage instead of numerical ones within each group of rollouts, iteratively distilling high-quality experiential knowledge during multi-epoch learning on a minimal ground-truth data. Such knowledge serves as the learned token prior, which is seamlessly integrated during LLM API calls to guide model behavior. Experiments on mathematical reasoning and web searching tasks demonstrate that Training-Free GRPO, when applied to DeepSeek-V3.1-Terminus, significantly improves out-of-domain performance. With just a few dozen training samples, Training-Free GRPO outperforms fine-tuned small LLMs with marginal training data and cost.

View Paper