From Segments to Scenes: Temporal Understanding in Autonomous Driving via Vision-Language Model

Kevin Cannons, Saeed Ranjbar Alvar, Mohammad Asiful Hossain, Ahmad Rezaei, Mohsen Gholami, Alireza Heidarikhazaei, Zhou Weimin, Yong Zhang, Mohammad Akbari

2025-12-08

From Segments to Scenes: Temporal Understanding in Autonomous Driving via Vision-Language Model

Summary

This paper focuses on the difficulty computers have understanding how things change over time when it comes to self-driving cars, even with the most advanced AI models currently available.

What's the problem?

Existing tests for AI’s ability to understand videos don’t focus on the specific challenges of driving footage, like predicting what other cars or pedestrians will do. Current AI models struggle with accurately interpreting the subtle movements and relationships between objects in a driving scene, making it hard for them to truly 'understand' what’s happening over time.

What's the solution?

The researchers created a new test, called TAD, specifically designed for evaluating AI on driving-related temporal understanding. They then tested several existing AI models on this new test and found they didn’t perform well. To fix this, they developed two new techniques – Scene-CoT and TCogMap – that help the AI better understand motion and the overall scene without needing to retrain the models, improving accuracy by up to 17.72%.

Why it matters?

Improving an AI’s ability to understand time-based events is crucial for self-driving cars to make safe and accurate decisions. This work provides a new way to test and improve these systems, ultimately pushing the field closer to fully autonomous vehicles.

Abstract

Temporal understanding in autonomous driving (AD) remains a significant challenge, even for recent state-of-the-art (SoTA) Vision-Language Models (VLMs). Prior work has introduced datasets and benchmarks aimed at improving temporal reasoning, but these have emphasized other video content, including sports, cooking, and movies. No existing benchmark focuses exclusively on the unique challenges of temporal understanding in ego-centric AD footage. To fill this gap, the Temporal Understanding in Autonomous Driving (TAD) benchmark is presented, which evaluates VLMs' ability to capture the dynamic relationships between actions in AD. TAD comprises nearly 6,000 question-answer (QA) pairs, spanning 7 human-designed tasks. In addition, an evaluation is performed that consists of 9 closed- and open-source generalist models as well as SoTA AD specialist models. When applied to TAD, current SoTA models demonstrated substandard accuracies, largely due to imperfect fine-grained motion understanding. To improve motion understanding and overall accuracy on TAD, two novel training-free solutions are proposed: Scene-CoT, that leverages Chain-of-Thought (CoT) and TCogMap, which incorporates an ego-centric temporal cognitive map. The proposed approaches are integrated with existing VLMs and improve average accuracy on TAD by up to 17.72%. By introducing TAD, benchmarking multiple SoTA models, and proposing effective enhancements, this work aims to catalyze future research on temporal understanding in AD. The benchmark and evaluation code are available at https://huggingface.co/datasets/vbdai/TAD{Hugging Face} and https://github.com/vbdi/tad_bench{Github}, respectively.

View Paper