Rectifying LLM Thought from Lens of Optimization

Junnan Liu, Hongwei Liu, Songyang Zhang, Kai Chen

2025-12-02

Rectifying LLM Thought from Lens of Optimization

Summary

This paper investigates why large language models, even when they seem to 'think' step-by-step, sometimes struggle with complex problems, and proposes a way to improve their reasoning abilities.

What's the problem?

Large language models are getting better at solving problems by showing their work – essentially, thinking through the problem in a series of steps, a technique called 'chain-of-thought' prompting. However, these models often get stuck in overly long or convoluted thought processes, which actually *hurts* their ability to find the correct answer. It's like overthinking a test question and getting more confused. The core issue is that the way these models 'reason' isn't always efficient or focused.

What's the solution?

The researchers looked at the reasoning process as similar to how computers learn through trial and error, specifically a method called 'gradient descent'. They developed a technique called RePro, which stands for Rectifying Process-level Reward. RePro essentially gives the model feedback not just on the final answer, but also on *how* it's thinking. It measures how intensely the model is trying to solve the problem and how stable its reasoning is, then uses this information to guide the model towards more effective thought patterns. This is done using a technique called reinforcement learning, where the model gets 'rewards' for good reasoning steps.

Why it matters?

This research is important because it helps us understand *why* large language models sometimes fail, even when they have the potential to succeed. By improving the reasoning process itself, rather than just focusing on the final answer, RePro makes these models more reliable and accurate across a variety of challenging tasks like math, science, and coding. This could lead to more trustworthy and helpful AI systems in the future.

Abstract

Recent advancements in large language models (LLMs) have been driven by their emergent reasoning capabilities, particularly through long chain-of-thought (CoT) prompting, which enables thorough exploration and deliberation. Despite these advances, long-CoT LLMs often exhibit suboptimal reasoning behaviors, such as overthinking and excessively protracted reasoning chains, which can impair performance. In this paper, we analyze reasoning processes through an optimization lens, framing CoT as a gradient descent procedure where each reasoning step constitutes an update toward problem resolution. Building on this perspective, we introduce RePro (Rectifying Process-level Reward), a novel approach to refine LLM reasoning during post-training. RePro defines a surrogate objective function to assess the optimization process underlying CoT, utilizing a dual scoring mechanism to quantify its intensity and stability. These scores are aggregated into a composite process-level reward, seamlessly integrated into reinforcement learning with verifiable rewards (RLVR) pipelines to optimize LLMs. Extensive experiments across multiple reinforcement learning algorithms and diverse LLMs, evaluated on benchmarks spanning mathematics, science, and coding, demonstrate that RePro consistently enhances reasoning performance and mitigates suboptimal reasoning behaviors.

View Paper