Qwen3.5-Omni Technical Report
Qwen Team
2026-04-20
Summary
This paper introduces Qwen3.5-Omni, a new and improved artificial intelligence model that can understand and process information from text, images, and audio all at once. It's a big step forward in creating AI that interacts with the world more like humans do.
What's the problem?
Existing AI models often struggle to seamlessly combine different types of information – like understanding what’s being said in a video or creating realistic-sounding speech. Specifically, generating continuous, natural-sounding speech is hard because the way computers represent text and speech isn't always aligned, leading to choppy or unnatural results. Also, many models don't handle multiple languages or complex audio-visual tasks very well.
What's the solution?
The researchers built Qwen3.5-Omni, a very large model trained on a huge amount of diverse data including text, images, and over 100 million hours of audio-visual content. They used a special technique called a 'Hybrid Attention Mixture-of-Experts' to help the model process long sequences of information efficiently. To improve speech generation, they developed ARIA, a system that dynamically adjusts how text and speech are processed to create smoother, more natural-sounding conversations. The model also supports 10 languages and can even generate code based on audio and visual cues.
Why it matters?
Qwen3.5-Omni represents a significant advancement in AI because it performs better than other models, like Gemini-3.1 Pro, on many audio and audio-visual tasks. It opens up possibilities for more natural and intuitive human-computer interactions, like AI assistants that can truly understand and respond to the world around them. The ability to generate code from audio-visual input is a completely new capability that could revolutionize how we interact with technology.
Abstract
In this work, we present Qwen3.5-Omni, the latest advancement in the Qwen-Omni model family. Representing a significant evolution over its predecessor, Qwen3.5-Omni scales to hundreds of billions of parameters and supports a 256k context length. By leveraging a massive dataset comprising heterogeneous text-vision pairs and over 100 million hours of audio-visual content, the model demonstrates robust omni-modality capabilities. Qwen3.5-Omni-plus achieves SOTA results across 215 audio and audio-visual understanding, reasoning, and interaction subtasks and benchmarks, surpassing Gemini-3.1 Pro in key audio tasks and matching it in comprehensive audio-visual understanding. Architecturally, Qwen3.5-Omni employs a Hybrid Attention Mixture-of-Experts (MoE) framework for both Thinker and Talker, enabling efficient long-sequence inference. The model facilitates sophisticated interaction, supporting over 10 hours of audio understanding and 400 seconds of 720P video (at 1 FPS). To address the inherent instability and unnaturalness in streaming speech synthesis, often caused by encoding efficiency discrepancies between text and speech tokenizers, we introduce ARIA. ARIA dynamically aligns text and speech units, significantly enhancing the stability and prosody of conversational speech with minimal latency impact. Furthermore, Qwen3.5-Omni expands linguistic boundaries, supporting multilingual understanding and speech generation across 10 languages with human-like emotional nuance. Finally, Qwen3.5-Omni exhibits superior audio-visual grounding capabilities, generating script-level structured captions with precise temporal synchronization and automated scene segmentation. Remarkably, we observed the emergence of a new capability in omnimodal models: directly performing coding based on audio-visual instructions, which we call Audio-Visual Vibe Coding.