StreamGaze: Gaze-Guided Temporal Reasoning and Proactive Understanding in Streaming Videos

Daeun Lee, Subhojyoti Mukherjee, Branislav Kveton, Ryan A. Rossi, Viet Dac Lai, Seunghyun Yoon, Trung Bui, Franck Dernoncourt, Mohit Bansal

2025-12-02

StreamGaze: Gaze-Guided Temporal Reasoning and Proactive Understanding in Streaming Videos

Summary

This paper introduces a new way to test how well artificial intelligence models understand videos in real-time, specifically focusing on how they can use eye-tracking data to predict what a user is looking at and what they might do next.

What's the problem?

Current AI models are good at understanding videos piece by piece, but they struggle to understand videos as they *stream* in, like when you're watching a live video or using augmented reality glasses. Also, no one has really tested if these models can use information about where a person is looking (their gaze) to better understand the video and even anticipate what the person will be interested in seeing next. Existing tests don't measure how well AI can interpret and use gaze signals in a streaming video context.

What's the solution?

The researchers created a new benchmark called StreamGaze. This benchmark includes tasks that challenge AI models to use real-time gaze data to follow a user’s attention, understand their intentions, and even predict what they’ll look at in the future, all while processing a video as it streams. They built a system to create questions and answers about videos that are linked to where a person was actually looking, making the test very realistic. They then tested several state-of-the-art AI models on this benchmark.

Why it matters?

This work is important because it highlights that current AI models aren't very good at understanding videos like humans do, especially when considering real-time streaming and eye movements. Improving this ability is crucial for applications like AR glasses, where the device needs to understand what you're looking at to provide relevant information or assistance. The new benchmark and data released will help researchers develop better AI models that can truly understand and interact with the world around us.

Abstract

Streaming video understanding requires models not only to process temporally incoming frames, but also to anticipate user intention for realistic applications like AR glasses. While prior streaming benchmarks evaluate temporal reasoning, none measure whether MLLMs can interpret or leverage human gaze signals within a streaming setting. To fill this gap, we introduce StreamGaze, the first benchmark designed to evaluate how effectively MLLMs use gaze for temporal and proactive reasoning in streaming videos. StreamGaze introduces gaze-guided past, present, and proactive tasks that comprehensively evaluate streaming video understanding. These tasks assess whether models can use real-time gaze to follow shifting attention and infer user intentions from only past and currently observed frames. To build StreamGaze, we develop a gaze-video QA generation pipeline that aligns egocentric videos with raw gaze trajectories via fixation extraction, region-specific visual prompting, and scanpath construction. This pipeline produces spatio-temporally grounded QA pairs that closely reflect human perceptual dynamics. Across all StreamGaze tasks, we observe substantial performance gaps between state-of-the-art MLLMs and human performance, revealing fundamental limitations in gaze-based temporal reasoning, intention modeling, and proactive prediction. We further provide detailed analyses of gaze-prompting strategies, reasoning behaviors, and task-specific failure modes, offering deeper insight into why current MLLMs struggle and what capabilities future models must develop. All data and code will be publicly released to support continued research in gaze-guided streaming video understanding.

View Paper