TTRV: Test-Time Reinforcement Learning for Vision Language Models
Akshit Singh, Shyam Marjit, Wei Lin, Paul Gavrikov, Serena Yeung-Levy, Hilde Kuehne, Rogerio Feris, Sivan Doveh, James Glass, M. Jehanzeb Mirza
2025-10-09
Summary
This paper introduces a new method called TTRV that improves how well AI models understand both images and language, specifically in tasks like identifying objects and answering questions about images.
What's the problem?
Current AI models that learn through trial and error, called Reinforcement Learning, usually need a lot of specifically labeled data to train. This is different from how humans learn, where we learn directly from interacting with the world without needing someone to constantly tell us what’s right or wrong. Existing methods struggle to adapt to new situations without this labeled data.
What's the solution?
TTRV tackles this by letting the AI model improve *while* it’s being used, not just during a separate training phase. It does this by giving the model 'rewards' based on how often it produces certain answers, encouraging it to be more consistent. It also encourages the model to give a variety of answers, preventing it from getting stuck on just one possibility. This adaptation happens 'on the fly' with no extra labeled data needed.
Why it matters?
This research is important because TTRV allows AI models to perform as well as, or even better than, some of the most advanced, privately owned AI systems like GPT-4o, especially in image recognition. It shows that AI can learn and improve in real-time, even with very limited information, making it more adaptable and potentially more useful in real-world scenarios where labeled data is scarce.
Abstract
Existing methods for extracting reward signals in Reinforcement Learning typically rely on labeled data and dedicated training splits, a setup that contrasts with how humans learn directly from their environment. In this work, we propose TTRV to enhance vision language understanding by adapting the model on the fly at inference time, without the need for any labeled data. Concretely, we enhance the Group Relative Policy Optimization (GRPO) framework by designing rewards based on the frequency of the base model's output, while inferring on each test sample multiple times. Further, we also propose to control the diversity of the model's output by simultaneously rewarding the model for obtaining low entropy of the output empirical distribution. Our approach delivers consistent gains across both object recognition and visual question answering (VQA), with improvements of up to 52.4% and 29.8%, respectively, and average boosts of 24.6% and 10.0% across 16 datasets.Remarkably, on image recognition, TTRV applied to InternVL 8B surpasses GPT-4o by an average of 2.3% over 8 benchmarks, while remaining highly competitive on VQA, demonstrating that test-time reinforcement learning can match or exceed the strongest proprietary models. Finally, we find many interesting properties of test-time RL for VLMs: for example, even in extremely data-constrained scenarios, where adaptation is performed on a single randomly chosen unlabeled test example, TTRV still yields non-trivial improvements of up to 5.5% in recognition tasks.