TOPReward: Token Probabilities as Hidden Zero-Shot Rewards for Robotics
Shirui Chen, Cole Harrison, Ying-Chun Lee, Angela Jin Yang, Zhongzheng Ren, Lillian J. Ratliff, Jiafei Duan, Dieter Fox, Ranjay Krishna
2026-02-24
Summary
This paper focuses on improving how robots learn from instructions and visual information, specifically when the reward for doing a good job is delayed or hard to define.
What's the problem?
Robots are getting better at understanding language and vision together, but they still struggle to learn complex tasks through trial and error, especially in the real world. This is because it's hard for them to figure out if they're making progress when they don't get clear, immediate feedback. Existing methods for estimating progress often don't work well when applied to new, different tasks.
What's the solution?
The researchers developed a new method called TOPReward. Instead of asking a vision-language model to directly *tell* it how much progress the robot has made (which can be inaccurate), TOPReward looks at the model's *internal* calculations – specifically, the 'token logits' – to understand how well the robot is doing. This leverages the knowledge the model already has about the world and how tasks generally work, providing a more reliable signal of progress.
Why it matters?
TOPReward significantly improves a robot's ability to learn new tasks without needing a lot of examples or perfectly designed rewards. It works well across many different tasks and robot types, and can also be used to determine if a robot has successfully completed a task or to help it learn by imitating good behavior. This is a big step towards creating robots that can more easily adapt to and perform tasks in the real world.
Abstract
While Vision-Language-Action (VLA) models have seen rapid progress in pretraining, their advancement in Reinforcement Learning (RL) remains hampered by low sample efficiency and sparse rewards in real-world settings. Developing generalizable process reward models is essential for providing the fine-grained feedback necessary to bridge this gap, yet existing temporal value functions often fail to generalize beyond their training domains. We introduce TOPReward, a novel, probabilistically grounded temporal value function that leverages the latent world knowledge of pretrained video Vision-Language Models (VLMs) to estimate robotic task progress. Unlike prior methods that prompt VLMs to directly output progress values, which are prone to numerical misrepresentation, TOPReward extracts task progress directly from the VLM's internal token logits. In zero-shot evaluations across 130+ distinct real-world tasks and multiple robot platforms (e.g., Franka, YAM, SO-100/101), TOPReward achieves 0.947 mean Value-Order Correlation (VOC) on Qwen3-VL, dramatically outperforming the state-of-the-art GVL baseline which achieves near-zero correlation on the same open-source model. We further demonstrate that TOPReward serves as a versatile tool for downstream applications, including success detection and reward-aligned behavior cloning.