WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning

Woongyeong Yeo, Kangsan Kim, Jaehong Yoon, Sung Ju Hwang

2025-12-03

WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning

Summary

This paper introduces a new system called WorldMM that helps computers understand really long videos, like those lasting hours or even days, by remembering important details and using both what's happening visually and what's being said.

What's the problem?

Current AI models are good at understanding short video clips, but they struggle with longer videos because they can't remember everything that happens and often lose important visual information when trying to summarize it. Existing methods try to fix this by using text summaries, but they don't pay enough attention to the actual images and have trouble with events that don't fit neatly into fixed timeframes.

What's the solution?

The researchers created WorldMM, which is like a computer with three different types of memory. One memory stores specific events at different time scales, another keeps track of general concepts, and the third focuses on preserving detailed visual information. When asked a question about a video, WorldMM smartly decides which memory to use and how much detail to look at, constantly refining its search until it finds enough information to answer correctly.

Why it matters?

This work is important because it significantly improves the ability of AI to understand long videos, leading to better performance on tasks like answering questions about what's happening. It's a step towards AI that can truly 'watch' and comprehend complex visual stories, outperforming previous methods by a noticeable margin.

Abstract

Recent advances in video large language models have demonstrated strong capabilities in understanding short clips. However, scaling them to hours- or days-long videos remains highly challenging due to limited context capacity and the loss of critical visual details during abstraction. Existing memory-augmented methods mitigate this by leveraging textual summaries of video segments, yet they heavily rely on text and fail to utilize visual evidence when reasoning over complex scenes. Moreover, retrieving from fixed temporal scales further limits their flexibility in capturing events that span variable durations. To address this, we introduce WorldMM, a novel multimodal memory agent that constructs and retrieves from multiple complementary memories, encompassing both textual and visual representations. WorldMM comprises three types of memory: episodic memory indexes factual events across multiple temporal scales, semantic memory continuously updates high-level conceptual knowledge, and visual memory preserves detailed information about scenes. During inference, an adaptive retrieval agent iteratively selects the most relevant memory source and leverages multiple temporal granularities based on the query, continuing until it determines that sufficient information has been gathered. WorldMM significantly outperforms existing baselines across five long video question-answering benchmarks, achieving an average 8.4% performance gain over previous state-of-the-art methods, showing its effectiveness on long video reasoning.

View Paper