Video Reasoning without Training

Deepak Sridhar, Kartikeya Bhardwaj, Jeya Pradha Jeyaraj, Nuno Vasconcelos, Ankita Nayak, Harris Teague

2025-10-22

Summary

This paper focuses on improving how well large AI models understand and reason about videos, specifically addressing the high cost and limited control in current methods.

What's the problem?

Currently, getting these AI models to 'think' through video reasoning problems requires a lot of computing power, either through complex training processes like reinforcement learning or by having the model explain its thinking step-by-step, which creates lengthy outputs. Also, we don't really understand *how* these models are thinking, and controlling that process is difficult. They can sometimes get stuck exploring random ideas instead of focusing on a solution.

What's the solution?

The researchers discovered that good reasoning models naturally balance exploring different possibilities with focusing on promising ones, and they measure this balance using something called 'entropy' – basically, how uncertain the model is. They then developed a method called V-Reason that subtly adjusts the model's internal workings *while it's being used* (during inference), guiding it to explore and focus more effectively, without any additional training or needing labeled data. It's like giving the model a little nudge in the right direction.

Why it matters?

This work is important because it makes video reasoning with AI much more efficient and accurate. It gets the performance close to models that require extensive training, but without the huge computational cost. This means we can potentially use these powerful AI models for video understanding in more practical situations, and it gives us a better understanding of how these models actually 'think' when solving problems.

Abstract

Video reasoning using Large Multimodal Models (LMMs) relies on costly reinforcement learning (RL) and verbose chain-of-thought, resulting in substantial computational overhead during both training and inference. Moreover, the mechanisms that control the thinking process in these reasoning models are very limited. In this paper, using entropy of the model's output as a signal, we discover that the high-quality models go through a series of micro-explorations and micro-exploitations which keep the reasoning process grounded (i.e., avoid excessive randomness while the model is exploring or thinking through an answer). We further observe that once this "thinking" process is over, more accurate models demonstrate a better convergence by reducing the entropy significantly via a final exploitation phase (i.e., a more certain convergence towards a solution trajectory). We then use these novel, theoretically-grounded insights to tune the model's behavior directly at inference, without using any RL or supervised fine-tuning. Specifically, during inference, our proposed approach called V-Reason (Video-Reason) adapts the value cache of the LMM via a few optimization steps on a small, trainable controller using an entropy-based objective, i.e., no supervision from any dataset or RL is necessary. This tuning improves the model's micro-exploration and exploitation behavior during inference. Our experiments show that our proposed method achieves significant improvements over the base instruction-tuned models across several video reasoning datasets, narrowing the gap with RL-trained models to within 0.6% average accuracy without any training, while offering massive efficiency benefits: output tokens are reduced by 58.6% compared to the RL model.

View Paper