FIPO: Eliciting Deep Reasoning with Future-KL Influenced Policy Optimization

Chiyu Ma, Shuo Yang, Kexin Huang, Jinda Lu, Haoming Meng, Shangshang Wang, Bolin Ding, Soroush Vosoughi, Guoyin Wang, Jingren Zhou

2026-04-01

FIPO: Eliciting Deep Reasoning with Future-KL Influenced Policy Optimization

Summary

This paper introduces a new method called FIPO, which improves how large language models perform complex reasoning tasks by helping them better understand which parts of their thought process are most important.

What's the problem?

Current methods for training these models often give equal credit to every word they generate, even though some words are much more crucial for reaching the correct answer than others. This is like giving everyone on a basketball team the same praise, even if some players didn't contribute much to the win. This limits how well the models can reason through long, complicated problems because they can't focus on the key steps.

What's the solution?

FIPO solves this by looking ahead and estimating how much each word will influence the rest of the model's reasoning. It then gives more weight to words that are predicted to be more important, essentially highlighting the critical thinking steps. This is done by incorporating a measure called 'future-KL divergence' into the training process, which helps the model learn to prioritize impactful tokens.

Why it matters?

This research is important because it shows a way to unlock the full potential of large language models for complex reasoning. By improving how these models assign credit to different parts of their thought process, FIPO allows them to handle longer and more challenging problems, achieving significantly better results on reasoning benchmarks compared to existing methods. It suggests that focusing on 'dense advantage formulations' is key to making these models even smarter.

Abstract

We present Future-KL Influenced Policy Optimization (FIPO), a reinforcement learning algorithm designed to overcome reasoning bottlenecks in large language models. While GRPO style training scales effectively, it typically relies on outcome-based rewards (ORM) that distribute a global advantage uniformly across every token in a trajectory. We argue that this coarse-grained credit assignment imposes a performance ceiling by failing to distinguish critical logical pivots from trivial tokens. FIPO addresses this by incorporating discounted future-KL divergence into the policy update, creating a dense advantage formulation that re-weights tokens based on their influence on subsequent trajectory behavior. Empirically, FIPO enables models to break through the length stagnation seen in standard baselines. Evaluated on Qwen2.5-32B, FIPO extends the average chain-of-thought length from roughly 4,000 to over 10,000 tokens and increases AIME 2024 Pass@1 accuracy from 50.0% to a peak of 58.0% (converging at approximately 56.0\%). This outperforms both DeepSeek-R1-Zero-Math-32B (around 47.0%) and o1-mini (approximately 56.0%). Our results suggest that establishing dense advantage formulations is a vital path for evolving ORM-based algorithms to unlock the full reasoning potential of base models. We open-source our training system, built on the verl framework.

View Paper