A Benchmark and Agentic Framework for Omni-Modal Reasoning and Tool Use in Long Videos

Mohammed Irfan Kurpath, Jaseel Muhammad Kaithakkodan, Jinxing Zhou, Sahal Shaji Mullappilly, Mohammad Almansoori, Noor Ahsan, Beknur Kalmakhanbet, Sambal Shikhar, Rishabh Lalla, Jean Lahoud, Mariette Awad, Fahad Shahbaz Khan, Salman Khan, Rao Muhammad Anwer, Hisham Cholakkal

2025-12-22

A Benchmark and Agentic Framework for Omni-Modal Reasoning and Tool Use in Long Videos

Summary

This paper introduces a new way to test how well artificial intelligence understands long videos, considering both what's happening visually and what's being said in the video, including background sounds.

What's the problem?

Current tests for video understanding either focus on very long videos but don't include much different information like speech and sound, or they include lots of different information but aren't long enough to really test understanding over time. Also, most tests just give a single score, making it hard to see *why* an AI fails at a task. Basically, we need a better way to evaluate AI's ability to truly 'watch' and 'listen' to videos like humans do.

What's the solution?

The researchers created a new benchmark called LongShOTBench. This benchmark includes complex, open-ended questions about videos, conversations with the AI about the video, and tasks that require the AI to use tools to figure things out based on the video and audio. They also built an AI system, LongShOTAgent, that tries to tackle these challenges by breaking down long videos into smaller parts and refining its understanding step-by-step. All the videos and questions are checked by people to make sure they're accurate and consistent.

Why it matters?

This work is important because it shows that even the most advanced AI models still struggle with understanding long, complex videos. LongShOTBench provides a realistic and repeatable way to measure progress in this area and helps researchers build better AI systems that can truly understand the world around them through video.

Abstract

Long-form multimodal video understanding requires integrating vision, speech, and ambient audio with coherent long-range reasoning. Existing benchmarks emphasize either temporal length or multimodal richness, but rarely both and while some incorporate open-ended questions and advanced metrics, they mostly rely on single-score accuracy, obscuring failure modes. We introduce LongShOTBench, a diagnostic benchmark with open-ended, intent-driven questions; single- and multi-turn dialogues; and tasks requiring multimodal reasoning and agentic tool use across video, audio, and speech. Each item includes a reference answer and graded rubric for interpretable, and traceable evaluation. LongShOTBench is produced via a scalable, human-validated pipeline to ensure coverage and reproducibility. All samples in our LongShOTBench are human-verified and corrected. Furthermore, we present LongShOTAgent, an agentic system that analyzes long videos via preprocessing, search, and iterative refinement. On LongShOTBench, state-of-the-art MLLMs show large gaps: Gemini-2.5-Flash achieves 52.95%, open-source models remain below 30%, and LongShOTAgent attains 44.66%. These results underscore the difficulty of real-world long-form video understanding. LongShOTBench provides a practical, reproducible foundation for evaluating and improving MLLMs. All resources are available on GitHub: https://github.com/mbzuai-oryx/longshot.

View Paper