Klear: Unified Multi-Task Audio-Video Joint Generation
Jun Wang, Chunyu Qiang, Yuxin Guo, Yiran Wang, Xijuan Zeng, Chen Zhang, Pengfei Wan
2026-01-08
Summary
This paper introduces Klear, a new system for creating realistic audio and video together, responding to instructions. It aims to make generating both at the same time much better than existing methods.
What's the problem?
Currently, creating audio and video together is difficult. Existing non-commercial systems often have issues where the audio and video don't quite match up – like the lip movements don't align with the speech. They also struggle when one part (audio or video) is weak or missing, and don't perform well with different kinds of data. This happens because it's hard to get the audio and video to truly understand each other, systems don't generalize well to new situations, and there isn't enough good data with detailed descriptions of both audio and video.
What's the solution?
The researchers tackled these problems in three main ways. First, they designed a new system architecture called a 'single-tower' design that tightly connects the audio and video processing. This uses special building blocks and a way of paying attention to all parts of the audio and video simultaneously. Second, they developed a training method that gradually teaches the system to handle both audio and video, even when one is missing, and to build a strong understanding of how they relate to the real world. Finally, they created a large new dataset of audio and video with detailed captions, using a process to automatically find and filter high-quality, well-matched examples.
Why it matters?
This work is important because it significantly improves the quality and realism of generated audio and video. Klear performs better than previous methods, even rivaling a state-of-the-art system called Veo 3, and it can handle a wider range of situations. This represents a big step towards creating more advanced and versatile audio-video synthesis tools, opening up possibilities for things like more realistic virtual assistants, better video editing, and more immersive experiences.
Abstract
Audio-video joint generation has progressed rapidly, yet substantial challenges still remain. Non-commercial approaches still suffer audio-visual asynchrony, poor lip-speech alignment, and unimodal degradation, which can be stemmed from weak audio-visual correspondence modeling, limited generalization, and scarce high-quality dense-caption data. To address these issues, we introduce Klear and delve into three axes--model architecture, training strategy, and data curation. Architecturally, we adopt a single-tower design with unified DiT blocks and an Omni-Full Attention mechanism, achieving tight audio-visual alignment and strong scalability. Training-wise, we adopt a progressive multitask regime--random modality masking to joint optimization across tasks, and a multistage curriculum, yielding robust representations, strengthening A-V aligned world knowledge, and preventing unimodal collapse. For datasets, we present the first large-scale audio-video dataset with dense captions, and introduce a novel automated data-construction pipeline which annotates and filters millions of diverse, high-quality, strictly aligned audio-video-caption triplets. Building on this, Klear scales to large datasets, delivering high-fidelity, semantically and temporally aligned, instruction-following generation in both joint and unimodal settings while generalizing robustly to out-of-distribution scenarios. Across tasks, it substantially outperforms prior methods by a large margin and achieves performance comparable to Veo 3, offering a unified, scalable path toward next-generation audio-video synthesis.