Wan Streamer v0.1

NEW

Free Multimodal Realtime

LikeWebsite Promote

Key Features

End-to-end Transformer for synchronized speech and video responses.

Models language, audio, and video as both input and output tokens.

Uses block-causal attention for incremental streaming generation.

Targets full-duplex audio-visual interaction while continuing to perceive input.

Reports around 200 ms model-side response latency.

Supports 25 fps generation with short streaming units.

Avoids a multi-module ASR, LLM, TTS, and renderer pipeline.

Includes an arXiv paper and direct real-time recording video.

The model represents language, audio, and video as interleaved input and output tokens coordinated by block-causal attention. Its stack uses causal encoders, causal decoders, low-latency multimodal scheduling, and streaming units as short as 160 ms to support around 25 fps interaction.

Wan Streamer is useful for real-time agents, interactive avatars, multimodal assistants, and research on low-latency full-duplex communication. The project reports roughly 200 ms model-side latency and about 550 ms total interaction latency when network latency is included.

Get more likes & reach the top of search results by adding this button on your site!

Wan Streamer v0.1

Key Features

Zero to AI Engineer

Subscribe to the AI Search Newsletter