OmniForcing: Unleashing Real-time Joint Audio-Visual Generation

Yaofeng Su, Yuming Li, Zeyue Xue, Jie Huang, Siming Fu, Haoran Li, Ying Li, Zezhong Qian, Haoyang Huang, Nan Duan

2026-03-16

OmniForcing: Unleashing Real-time Joint Audio-Visual Generation

Summary

This paper introduces a new technique called OmniForcing to make AI models that generate both audio and video much faster, specifically for real-time applications like creating content on the fly.

What's the problem?

Current AI models that create audio and video together are really good quality, but they're slow because they need to look back and forth between the audio and video to make sure everything lines up correctly. This 'looking back and forth' takes time, making it hard to use these models in situations where you need an immediate result, like live streaming or video editing. Trying to simplify these models to make them faster often leads to instability during training and a loss of quality, especially because audio and video information naturally happen at different speeds.

What's the solution?

OmniForcing solves this by essentially 'teaching' a faster, one-way model to mimic a slower, more accurate model. It does this in a few key ways: first, it uses a special alignment technique to keep the audio and video synchronized even though they're processed differently. Second, it handles the fact that audio has fewer 'pieces of information' than video by creating a way for the model to focus on the important audio parts. Finally, it allows the model to learn from its own mistakes during the generation process, constantly correcting itself to maintain quality. The system also uses a clever way to store and reuse past information, making it even faster.

Why it matters?

This research is important because it allows for the creation of high-quality audio and video content in real-time. This opens up possibilities for things like instant video creation from text prompts, live-generated soundtracks for videos, and more interactive and dynamic multimedia experiences. The speed improvements mean these powerful AI tools can be used in a wider range of applications without needing expensive hardware or long processing times.

Abstract

Recent joint audio-visual diffusion models achieve remarkable generation quality but suffer from high latency due to their bidirectional attention dependencies, hindering real-time applications. We propose OmniForcing, the first framework to distill an offline, dual-stream bidirectional diffusion model into a high-fidelity streaming autoregressive generator. However, naively applying causal distillation to such dual-stream architectures triggers severe training instability, due to the extreme temporal asymmetry between modalities and the resulting token sparsity. We address the inherent information density gap by introducing an Asymmetric Block-Causal Alignment with a zero-truncation Global Prefix that prevents multi-modal synchronization drift. The gradient explosion caused by extreme audio token sparsity during the causal shift is further resolved through an Audio Sink Token mechanism equipped with an Identity RoPE constraint. Finally, a Joint Self-Forcing Distillation paradigm enables the model to dynamically self-correct cumulative cross-modal errors from exposure bias during long rollouts. Empowered by a modality-independent rolling KV-cache inference scheme, OmniForcing achieves state-of-the-art streaming generation at sim25 FPS on a single GPU, maintaining multi-modal synchronization and visual quality on par with the bidirectional teacher.Project Page: https://omniforcing.com{https://omniforcing.com}

View Paper