TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs

Jun Zhang, Teng Wang, Yuying Ge, Yixiao Ge, Xinhao Li, Ying Shan, Limin Wang

2025-12-17

TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs

Summary

This paper focuses on improving how well computers understand *when* things happen in videos, a skill called video temporal grounding. It doesn't invent a completely new technique, but instead carefully examines and improves existing methods using large language models that can process both text and video.

What's the problem?

Currently, evaluating how well these models understand timing in videos is unreliable because the datasets used for testing contain errors and inconsistencies. Also, the data used to *train* these models is often noisy, meaning it’s not very accurate. This makes it hard to know if a model is truly improving or just getting better at recognizing flaws in the testing data.

What's the solution?

The researchers created a new, carefully checked dataset called TimeLens-Bench with more accurate timing information. They also developed a way to automatically clean up existing training data, resulting in a large, high-quality dataset called TimeLens-100K. Beyond the data, they figured out the best ways to structure the models and train them, using a specific training method that rewards correct timing and a way to represent time within the model itself. This all led to the creation of TimeLens models.

Why it matters?

These improvements allow computers to understand the timing of events in videos much more accurately, even surpassing the performance of some leading, closed-source models like GPT-5 and Gemini. By releasing their data, code, and models, the researchers are helping to advance the field of video understanding and make it easier for others to build upon their work.

Abstract

This paper does not introduce a novel method but instead establishes a straightforward, incremental, yet essential baseline for video temporal grounding (VTG), a core capability in video understanding. While multimodal large language models (MLLMs) excel at various video understanding tasks, the recipes for optimizing them for VTG remain under-explored. In this paper, we present TimeLens, a systematic investigation into building MLLMs with strong VTG ability, along two primary dimensions: data quality and algorithmic design. We first expose critical quality issues in existing VTG benchmarks and introduce TimeLens-Bench, comprising meticulously re-annotated versions of three popular benchmarks with strict quality criteria. Our analysis reveals dramatic model re-rankings compared to legacy benchmarks, confirming the unreliability of prior evaluation standards. We also address noisy training data through an automated re-annotation pipeline, yielding TimeLens-100K, a large-scale, high-quality training dataset. Building on our data foundation, we conduct in-depth explorations of algorithmic design principles, yielding a series of meaningful insights and effective yet efficient practices. These include interleaved textual encoding for time representation, a thinking-free reinforcement learning with verifiable rewards (RLVR) approach as the training paradigm, and carefully designed recipes for RLVR training. These efforts culminate in TimeLens models, a family of MLLMs with state-of-the-art VTG performance among open-source models and even surpass proprietary models such as GPT-5 and Gemini-2.5-Flash. All codes, data, and models will be released to facilitate future research.

View Paper