Map the Flow: Revealing Hidden Pathways of Information in VideoLLMs

Minji Kim, Taekyung Kim, Bohyung Han

2025-10-27

Map the Flow: Revealing Hidden Pathways of Information in VideoLLMs

Summary

This research paper investigates how Video Large Language Models, which are AI systems that can understand and answer questions about videos, actually work internally. It's about figuring out *where* and *how* these models process information from both the video and the text of a question.

What's the problem?

While VideoLLMs are getting better at tasks like answering questions about videos, we don't really understand *how* they're doing it. It's like knowing a student got the right answer on a test, but not knowing what steps they took to get there. This lack of understanding makes it hard to improve these models or trust their decisions.

What's the solution?

The researchers used a technique called 'mechanistic interpretability' to peek inside the VideoLLMs. They tracked how information flows through the different layers of the model while it's processing a video and a question. They found a consistent pattern: first, the model looks for important changes happening *across* different frames of the video. Then, it combines this video information with the meaning of the words in the question, especially words related to time. Finally, after integrating everything, the model can generate the correct answer. They also discovered that the models use only a small fraction of their connections to perform well, meaning a lot of the complexity isn't actually needed.

Why it matters?

This work is important because it gives us a 'blueprint' for how VideoLLMs think about time and video content. By understanding these internal processes, we can build better, more reliable, and easier-to-understand AI systems for video analysis. It also suggests ways to simplify these models without losing performance, making them more efficient.

Abstract

Video Large Language Models (VideoLLMs) extend the capabilities of vision-language models to spatiotemporal inputs, enabling tasks such as video question answering (VideoQA). Despite recent advances in VideoLLMs, their internal mechanisms on where and how they extract and propagate video and textual information remain less explored. In this study, we investigate the internal information flow of VideoLLMs using mechanistic interpretability techniques. Our analysis reveals consistent patterns across diverse VideoQA tasks: (1) temporal reasoning in VideoLLMs initiates with active cross-frame interactions in early-to-middle layers, (2) followed by progressive video-language integration in middle layers. This is facilitated by alignment between video representations and linguistic embeddings containing temporal concepts. (3) Upon completion of this integration, the model is ready to generate correct answers in middle-to-late layers. (4) Based on our analysis, we show that VideoLLMs can retain their VideoQA performance by selecting these effective information pathways while suppressing a substantial amount of attention edges, e.g., 58% in LLaVA-NeXT-7B-Video-FT. These findings provide a blueprint on how VideoLLMs perform temporal reasoning and offer practical insights for improving model interpretability and downstream generalization. Our project page with the source code is available at https://map-the-flow.github.io

View Paper