SofT-GRPO: Surpassing Discrete-Token LLM Reinforcement Learning via Gumbel-Reparameterized Soft-Thinking Policy Optimization

Zhi Zheng, Wee Sun Lee

2025-11-11

SofT-GRPO: Surpassing Discrete-Token LLM Reinforcement Learning via Gumbel-Reparameterized Soft-Thinking Policy Optimization

Summary

This paper explores a way to improve how large language models, or LLMs, 'think' through a method called 'soft-thinking'. Soft-thinking is an alternative to the more common 'chain-of-thought' reasoning, and can sometimes be more effective, but it's been difficult to train LLMs to use it well.

What's the problem?

While we can improve the standard 'chain-of-thought' reasoning by using techniques like policy optimization, it's been hard to do the same for 'soft-thinking'. The issue is that 'soft-thinking' involves a bit of randomness, and figuring out how to adjust the model's behavior with that randomness built in is complex. Previous attempts to combine 'soft-thinking' with these optimization techniques haven't performed as well as the standard methods.

What's the solution?

The researchers developed a new policy optimization algorithm called SofT-GRPO. This algorithm cleverly adds a specific type of noise to the model's calculations and uses a mathematical trick called the 'reparameterization trick' to make the learning process more stable and effective. It also ensures the 'soft-thinking' stays within the boundaries of what the model already understands from its initial training.

Why it matters?

This work is important because it unlocks the potential of 'soft-thinking' as a reasoning method for LLMs. Their experiments show that SofT-GRPO allows 'soft-thinking' models to perform slightly better than standard methods in some cases, and significantly better in others, suggesting it's a valuable step towards more powerful and flexible AI reasoning.

Abstract

The soft-thinking paradigm for Large Language Model (LLM) reasoning can outperform the conventional discrete-token Chain-of-Thought (CoT) reasoning in some scenarios, underscoring its research and application value. However, while the discrete-token CoT reasoning pattern can be reinforced through policy optimization algorithms such as group relative policy optimization (GRPO), extending the soft-thinking pattern with Reinforcement Learning (RL) remains challenging. This difficulty stems from the complexities of injecting stochasticity into soft-thinking tokens and updating soft-thinking policies accordingly. As a result, previous attempts to combine soft-thinking with GRPO typically underperform their discrete-token GRPO counterparts. To fully unlock the potential of soft-thinking, this paper presents a novel policy optimization algorithm, SofT-GRPO, to reinforce LLMs under the soft-thinking reasoning pattern. SofT-GRPO injects the Gumbel noise into logits, employs the Gumbel-Softmax technique to avoid soft-thinking tokens outside the pre-trained embedding space, and leverages the reparameterization trick in policy gradient. We conduct experiments across base LLMs ranging from 1.5B to 7B parameters, and results demonstrate that SofT-GRPO enables soft-thinking LLMs to slightly outperform discrete-token GRPO on Pass@1 (+0.13% on average accuracy), while exhibiting a substantial uplift on Pass@32 (+2.19% on average accuracy). Codes and weights are available on https://github.com/zz1358m/SofT-GRPO-master

View Paper