PEARL: Personalized Streaming Video Understanding Model

Yuanhong Zheng, Ruichuan An, Xiaopeng Lin, Yuxing Liu, Sihan Yang, Huanyu Zhang, Haodong Li, Qintong Zhang, Renrui Zhang, Guopeng Li, Yifan Zhang, Yuheng Li, Wentao Zhang

2026-03-25

PEARL: Personalized Streaming Video Understanding Model

Summary

This paper introduces a new challenge in AI: understanding videos in a way that’s personalized to you, and doing it *as* the video is playing, not just after it’s finished. It’s about making AI assistants more responsive and tailored to individual users in real-time.

What's the problem?

Current AI systems that try to understand what you like or what’s happening in a video usually work with still images or pre-recorded videos. They can’t keep up with a continuous stream of video and don’t react to things as they happen. This makes them less useful for things like a personal AI assistant that needs to understand your preferences *right now* while you’re watching something.

What's the solution?

The researchers created a new task called Personalized Streaming Video Understanding (PSVU) and a dataset called PEARL-Bench to help test AI models on this challenge. PEARL-Bench has a lot of videos with detailed notes about what’s happening at specific moments. They also developed a simple, but effective, method called PEARL that can be added to existing AI models to improve their performance on this new task. PEARL doesn't require extra training, making it easy to use.

Why it matters?

This work is important because it pushes AI closer to being truly interactive and personalized. If AI can understand videos in real-time and adapt to your preferences, it can create much more helpful and engaging experiences, like a smart assistant that knows what you’re interested in while you’re watching a movie with friends.

Abstract

Human cognition of new concepts is inherently a streaming process: we continuously recognize new objects or identities and update our memories over time. However, current multimodal personalization methods are largely limited to static images or offline videos. This disconnects continuous visual input from instant real-world feedback, limiting their ability to provide the real-time, interactive personalized responses essential for future AI assistants. To bridge this gap, we first propose and formally define the novel task of Personalized Streaming Video Understanding (PSVU). To facilitate research in this new direction, we introduce PEARL-Bench, the first comprehensive benchmark designed specifically to evaluate this challenging setting. It evaluates a model's ability to respond to personalized concepts at exact timestamps under two modes: (1) Frame-level, focusing on a specific person or object in discrete frames, and (2) a novel Video-level, focusing on personalized actions unfolding across continuous frames. PEARL-Bench comprises 132 unique videos and 2,173 fine-grained annotations with precise timestamps. Concept diversity and annotation quality are strictly ensured through a combined pipeline of automated generation and human verification. To tackle this challenging new setting, we further propose PEARL, a plug-and-play, training-free strategy that serves as a strong baseline. Extensive evaluations across 8 offline and online models demonstrate that PEARL achieves state-of-the-art performance. Notably, it brings consistent PSVU improvements when applied to 3 distinct architectures, proving to be a highly effective and robust strategy. We hope this work advances vision-language model (VLM) personalization and inspires further research into streaming personalized AI assistants. Code is available at https://github.com/Yuanhong-Zheng/PEARL.

View Paper