The model represents language, audio, and video as interleaved input and output tokens coordinated by block-causal attention. Its stack uses causal encoders, causal decoders, low-latency multimodal scheduling, and streaming units as short as 160 ms to support around 25 fps interaction.
Wan Streamer is useful for real-time agents, interactive avatars, multimodal assistants, and research on low-latency full-duplex communication. The project reports roughly 200 ms model-side latency and about 550 ms total interaction latency when network latency is included.


