AURA: Always-On Understanding and Real-Time Assistance via Video Streams

Xudong Lu, Yang Bo, Jinpeng Chen, Shuhan Li, Xintong Guo, Huankang Guan, Fang Liu, Dunyuan Xu, Peiwen Sun, Heyang Sun, Rui Liu, Hongsheng Li

2026-04-07

AURA: Always-On Understanding and Real-Time Assistance via Video Streams

Summary

This paper introduces AURA, a new system designed to allow AI to understand and interact with live video streams in a continuous and helpful way.

What's the problem?

Current video understanding AI models, called VideoLLMs, are generally built to analyze pre-recorded videos. They struggle with live video because they can't constantly watch and respond in real-time. Existing attempts at 'streaming' VideoLLMs either react slowly to events or are limited to simply describing what's happening, making it hard to have a real conversation or get detailed answers about the video.

What's the solution?

The researchers created AURA, which is a complete system for processing live video. It's designed to continuously analyze the video, remember what's happened before, and answer questions or even proactively offer information. They focused on making the system stable and able to work for extended periods. AURA uses special techniques for managing information, preparing data for the AI, and optimizing how it runs, allowing it to process video at a speed of 2 frames per second using powerful computer hardware.

Why it matters?

This work is important because it moves us closer to AI assistants that can truly understand the world around us in real-time. Imagine an AI that can watch a live sports game and answer your questions about what's happening, or help someone navigate a new environment by describing what they see. AURA provides a foundation for building these kinds of interactive and helpful AI systems, and the researchers are sharing their model and tools to encourage further development in this area.

Abstract

Video Large Language Models (VideoLLMs) have achieved strong performance on many video understanding tasks, but most existing systems remain offline and are not well-suited for live video streams that require continuous observation and timely response. Recent streaming VideoLLMs have made progress, yet current approaches often rely on decoupled trigger-response pipelines or are limited to captioning-style narration, reducing their effectiveness for open-ended question answering and long-horizon interaction. We propose AURA (Always-On Understanding and Real-Time Assistance), an end-to-end streaming visual interaction framework that enables a unified VideoLLM to continuously process video streams and support both real-time question answering and proactive responses. AURA integrates context management, data construction, training objectives, and deployment optimization for stable long-horizon streaming interaction. It achieves state-of-the-art performance on streaming benchmarks and supports a real-time demo system with ASR and TTS running at 2 FPS on two 80G accelerators. We release the AURA model together with a real-time inference framework to facilitate future research.

View Paper