TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning

Xingjian Zhang, Siwei Wen, Wenjun Wu, Lei Huang

2025-04-15

TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning

Summary

This paper talks about TinyLLaVA-Video-R1, which is a new, smaller AI model designed to understand and answer questions about videos. Even though it's much smaller than other models, it can still figure out tricky parts of videos and show sudden bursts of understanding, kind of like when a person has an 'aha moment.'

What's the problem?

The problem is that most video reasoning AI models are huge and require a lot of computer power, which makes them hard to use for most people and on regular devices. Smaller models usually can't keep up in terms of understanding or reasoning about what's happening in a video.

What's the solution?

The researchers trained TinyLLaVA-Video-R1 using reinforcement learning on a wide variety of video question-answering datasets. This special training helped the smaller model learn to reason better and even have moments where it suddenly understands something important in a video, just like bigger models can.

Why it matters?

This work matters because it shows that you don't always need a massive, expensive AI to get good results with video understanding. TinyLLaVA-Video-R1 makes it possible for more people and devices to use smart video reasoning, which could help with education, accessibility, and making technology more available to everyone.

Abstract

TinyLLaVA-Video-R1, a small-scale video reasoning model, demonstrates improved reasoning capabilities and exhibits "aha moments" after reinforcement learning on general Video-QA datasets.

View Paper