Alleviating Sparse Rewards by Modeling Step-Wise and Long-Term Sampling Effects in Flow-Based GRPO

Yunze Tong, Mushui Liu, Canyu Zhao, Wanggui He, Shiyi Zhang, Hongwei Zhang, Peng Zhang, Jinlong Liu, Ju Huang, Jiamang Wang, Hao Jiang, Pipei Huang

2026-02-10

Alleviating Sparse Rewards by Modeling Step-Wise and Long-Term Sampling Effects in Flow-Based GRPO

Summary

This paper introduces a new method, TurningPoint-GRPO (TP-GRPO), to improve how AI models generate images from text. It builds on existing techniques but addresses some key weaknesses in how those techniques provide feedback to the model during the image creation process.

What's the problem?

Current methods for guiding image generation using rewards often give feedback to *all* steps of the process based on the final image quality. This isn't ideal because it's hard to tell which specific steps were actually helpful or harmful. Also, these methods usually only compare the image at specific points in time, ignoring how earlier steps can subtly influence later ones, creating a delayed effect. Essentially, the feedback is too sparse and doesn't account for the full chain of events during image creation.

What's the solution?

TP-GRPO tackles these issues in two main ways. First, instead of just rewarding the final outcome, it gives smaller rewards at *each* step, based on how much that step specifically improved the image. This provides more detailed guidance. Second, it identifies 'turning points' – moments where the image quality starts consistently improving – and gives those steps extra credit, recognizing their long-term impact. These turning points are found simply by looking for changes in the direction of improvement, making the method efficient and easy to use.

Why it matters?

This research is important because it leads to better and more consistent image generation. By providing more informative feedback and accounting for the long-term effects of each step, TP-GRPO helps AI models create images that more closely match the desired text description. This improves the overall quality and reliability of text-to-image AI systems.

Abstract

Deploying GRPO on Flow Matching models has proven effective for text-to-image generation. However, existing paradigms typically propagate an outcome-based reward to all preceding denoising steps without distinguishing the local effect of each step. Moreover, current group-wise ranking mainly compares trajectories at matched timesteps and ignores within-trajectory dependencies, where certain early denoising actions can affect later states via delayed, implicit interactions. We propose TurningPoint-GRPO (TP-GRPO), a GRPO framework that alleviates step-wise reward sparsity and explicitly models long-term effects within the denoising trajectory. TP-GRPO makes two key innovations: (i) it replaces outcome-based rewards with step-level incremental rewards, providing a dense, step-aware learning signal that better isolates each denoising action's "pure" effect, and (ii) it identifies turning points-steps that flip the local reward trend and make subsequent reward evolution consistent with the overall trajectory trend-and assigns these actions an aggregated long-term reward to capture their delayed impact. Turning points are detected solely via sign changes in incremental rewards, making TP-GRPO efficient and hyperparameter-free. Extensive experiments also demonstrate that TP-GRPO exploits reward signals more effectively and consistently improves generation. Demo code is available at https://github.com/YunzeTong/TurningPoint-GRPO.

View Paper