Wan Streamer v0.1

NEW

Key Features

End-to-end Transformer for synchronized speech and video responses.
Models language, audio, and video as both input and output tokens.
Uses block-causal attention for incremental streaming generation.
Targets full-duplex audio-visual interaction while continuing to perceive input.
Reports around 200 ms model-side response latency.
Supports 25 fps generation with short streaming units.
Avoids a multi-module ASR, LLM, TTS, and renderer pipeline.
Includes an arXiv paper and direct real-time recording video.

The model represents language, audio, and video as interleaved input and output tokens coordinated by block-causal attention. Its stack uses causal encoders, causal decoders, low-latency multimodal scheduling, and streaming units as short as 160 ms to support around 25 fps interaction.


Wan Streamer is useful for real-time agents, interactive avatars, multimodal assistants, and research on low-latency full-duplex communication. The project reports roughly 200 ms model-side latency and about 550 ms total interaction latency when network latency is included.

Get more likes & reach the top of search results by adding this button on your site!

Embed button preview - Light theme
Embed button preview - Dark theme
TurboType Banner
Zero to AI Engineer Program

Zero to AI Engineer

Skip the degree. Learn real-world AI skills used by AI researchers and engineers. Get certified in 8 weeks or less. No experience required.

Subscribe to the AI Search Newsletter

Get top updates in AI to your inbox every weekend. It's free!