OST-Bench: Evaluating the Capabilities of MLLMs in Online Spatio-temporal Scene Understanding

JingLi Lin, Chenming Zhu, Runsen Xu, Xiaohan Mao, Xihui Liu, Tai Wang, Jiangmiao Pang

2025-07-11

OST-Bench: Evaluating the Capabilities of MLLMs in Online
Spatio-temporal Scene Understanding

Summary

This paper talks about OST-Bench, a new way to test how well multimodal large language models (MLLMs) understand and reason about scenes in videos as they actively explore them over time.

What's the problem?

Most current tests only check if models can understand images or videos that are already fully recorded, but don't measure how well they can handle new information that comes in step-by-step or remember and use past observations to understand changing scenes, which is important in real-life situations.

What's the solution?

The authors created OST-Bench, a benchmark with thousands of real-world scenes and questions that require models to process information as if they are exploring a place while remembering and reasoning about what they saw earlier. They tested top MLLMs and found these models struggle with tricky spatial and time-based reasoning and tend to forget or avoid using important past details as they get more information.

Why it matters?

This matters because OST-Bench highlights where AI still needs to improve to better understand and think about changing visual situations in a more human-like way, which is essential for applications like robots or assistants that need to interact with real environments continuously.

Abstract

OST-Bench evaluates multimodal large language models on online spatio-temporal reasoning tasks, highlighting challenges in dynamic spatial understanding and long-term memory retrieval.

View Paper