Video-R4: Reinforcing Text-Rich Video Reasoning with Visual Rumination

Yolo Yunlong Tang, Daiki Shimada, Hang Hua, Chao Huang, Jing Bi, Rogerio Feris, Chenliang Xu

2025-11-24

Video-R4: Reinforcing Text-Rich Video Reasoning with Visual Rumination

Summary

This paper introduces a new approach to help AI understand videos that contain a lot of text, like lectures or presentations. It focuses on how humans naturally process this kind of video – by pausing, re-watching, and focusing on specific parts.

What's the problem?

Current AI models struggle with videos that have a lot of text because they typically only look at each frame once. This single look isn't enough to properly read and understand the text, leading to mistakes and 'hallucinations' where the AI makes things up. The text in videos is often quick and small, requiring repeated viewing to grasp the meaning, something existing models can't do.

What's the solution?

The researchers developed a system called Video-R4 which mimics human visual attention. It works by repeatedly selecting important frames, zooming in on key areas with text, and then re-analyzing those zoomed-in sections. This 'visual rumination' process allows the AI to build a better understanding of the video's content over time. They also created datasets to train and test this system, and used a technique called reinforcement learning to improve its performance.

Why it matters?

This research is important because it significantly improves the ability of AI to understand complex, text-rich videos. This has implications for many applications, such as automatically answering questions about educational videos, understanding presentations, or even analyzing documents shown in videos. It shows that allowing AI to 're-read' and focus on details is a powerful way to improve its reasoning abilities.

Abstract

Understanding text-rich videos requires reading small, transient textual cues that often demand repeated inspection. Yet most video QA models rely on single-pass perception over fixed frames, leading to hallucinations and failures on fine-grained evidence. Inspired by how humans pause, zoom, and re-read critical regions, we introduce Video-R4 (Reinforcing Text-Rich Video Reasoning with Visual Rumination), a video reasoning LMM that performs visual rumination: iteratively selecting frames, zooming into informative regions, re-encoding retrieved pixels, and updating its reasoning state. We construct two datasets with executable rumination trajectories: Video-R4-CoT-17k for supervised practice and Video-R4-RL-30k for reinforcement learning. We propose a multi-stage rumination learning framework that progressively finetunes a 7B LMM to learn atomic and mixing visual operations via SFT and GRPO-based RL. Video-R4-7B achieves state-of-the-art results on M4-ViteVQA and further generalizes to multi-page document QA, slides QA, and generic video QA, demonstrating that iterative rumination is an effective paradigm for pixel-grounded multimodal reasoning.

View Paper