EVA: Efficient Reinforcement Learning for End-to-End Video Agent

Yaolun Zhang, Ruohui Wang, Jiahao Wang, Yepeng Tang, Xuanyu Zheng, Haonan Duan, Hao Lu, Hanming Deng, Lewei Lu

2026-03-26

EVA: Efficient Reinforcement Learning for End-to-End Video Agent

Summary

This paper introduces EVA, a new system designed to help computers understand videos more efficiently. It focuses on making these systems, called multimodal large language models, better at processing the long and complex information found in videos.

What's the problem?

Currently, computers struggle with understanding videos because videos contain a lot of information – many frames over time – and much of that information is repetitive. Existing methods either try to process the entire video at once, or they pick frames randomly, which isn't very smart. While some newer approaches use 'agents' to help, these agents still rely on pre-programmed steps and first try to understand *everything* before deciding what's important, making them slow for long videos.

What's the solution?

EVA solves this by using a system called 'planning-before-perception'. Think of it like a person deciding *what* to look at in a video and *when* to look at it, instead of just watching everything. It uses a three-step learning process – first learning from examples, then refining its strategy based on successes and failures, and finally optimizing its overall approach. This allows EVA to autonomously decide what parts of a video are relevant to answer a specific question.

Why it matters?

This research is important because it significantly improves how well computers can understand videos. EVA is more efficient and accurate than previous methods, achieving a noticeable performance boost on several video understanding tasks. This could lead to better video analysis for things like self-driving cars, video surveillance, or even just helping you quickly find the information you need in a long video.

Abstract

Video understanding with multimodal large language models (MLLMs) remains challenging due to the long token sequences of videos, which contain extensive temporal dependencies and redundant frames. Existing approaches typically treat MLLMs as passive recognizers, processing entire videos or uniformly sampled frames without adaptive reasoning. Recent agent-based methods introduce external tools, yet still depend on manually designed workflows and perception-first strategies, resulting in inefficiency on long videos. We present EVA, an Efficient Reinforcement Learning framework for End-to-End Video Agent, which enables planning-before-perception through iterative summary-plan-action-reflection reasoning. EVA autonomously decides what to watch, when to watch, and how to watch, achieving query-driven and efficient video understanding. To train such agents, we design a simple yet effective three-stage learning pipeline - comprising supervised fine-tuning (SFT), Kahneman-Tversky Optimization (KTO), and Generalized Reward Policy Optimization (GRPO) - that bridges supervised imitation and reinforcement learning. We further construct high-quality datasets for each stage, supporting stable and reproducible training. We evaluate EVA on six video understanding benchmarks, demonstrating its comprehensive capabilities. Compared with existing baselines, EVA achieves a substantial improvement of 6-12% over general MLLM baselines and a further 1-3% gain over prior adaptive agent methods. Our code and model are available at https://github.com/wangruohui/EfficientVideoAgent.

View Paper