LTX-2: Efficient Joint Audio-Visual Foundation Model
Yoav HaCohen, Benny Brazowski, Nisan Chiprut, Yaki Bitterman, Andrew Kvochko, Avishai Berkowitz, Daniel Shalem, Daphna Lifschitz, Dudu Moshe, Eitan Porat, Eitan Richardson, Guy Shiran, Itay Chachy, Jonathan Chetboun, Michael Finkelson, Michael Kupchick, Nir Zabari, Nitzan Guetta, Noa Kotler, Ofir Bibi, Ori Gordon, Poriya Panet
2026-01-07
Summary
This paper introduces LTX-2, a new artificial intelligence model that can create videos *with* matching sound, not just silent videos like many existing models. It's designed to generate both the visuals and audio at the same time, making the results more realistic and engaging.
What's the problem?
Current AI models are really good at making videos from text descriptions, but they only produce the visual part. This is a problem because sound is a huge part of how we experience videos – it adds emotion, tells us about the environment, and helps us understand what's happening. Without sound, these AI-generated videos feel incomplete and less believable.
What's the solution?
The researchers built LTX-2, which uses two 'streams' of processing, one for video and one for audio. These streams are connected so they can 'talk' to each other, ensuring the sound matches the video. The video stream is larger because creating realistic video is more complex. They also improved how the model understands text prompts in multiple languages and developed a way to better control how well the audio and video align. Essentially, they made a system where the AI thinks about sound and video *together* from the start.
Why it matters?
This work is important because it brings us closer to AI that can create full, immersive audiovisual experiences. LTX-2 performs as well as, or even better than, some closed-source (non-publicly available) models, but it's open-source, meaning anyone can use and build upon it. This could lead to new tools for filmmakers, artists, and anyone who wants to create videos with AI, and it does so more efficiently than many existing methods.
Abstract
Recent text-to-video diffusion models can generate compelling video sequences, yet they remain silent -- missing the semantic, emotional, and atmospheric cues that audio provides. We introduce LTX-2, an open-source foundational model capable of generating high-quality, temporally synchronized audiovisual content in a unified manner. LTX-2 consists of an asymmetric dual-stream transformer with a 14B-parameter video stream and a 5B-parameter audio stream, coupled through bidirectional audio-video cross-attention layers with temporal positional embeddings and cross-modality AdaLN for shared timestep conditioning. This architecture enables efficient training and inference of a unified audiovisual model while allocating more capacity for video generation than audio generation. We employ a multilingual text encoder for broader prompt understanding and introduce a modality-aware classifier-free guidance (modality-CFG) mechanism for improved audiovisual alignment and controllability. Beyond generating speech, LTX-2 produces rich, coherent audio tracks that follow the characters, environment, style, and emotion of each scene -- complete with natural background and foley elements. In our evaluations, the model achieves state-of-the-art audiovisual quality and prompt adherence among open-source systems, while delivering results comparable to proprietary models at a fraction of their computational cost and inference time. All model weights and code are publicly released.