Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Character Generation
Yupeng Zhou, Lianghua Huang, Zhifan Wu, Jiabao Wang, Yupeng Shi, Biao Jiang, Daquan Zhou, Yu Liu, Ming-Ming Cheng, Qibin Hou
2026-04-29
Summary
This paper introduces a new technique called Mutual Forcing to create realistic audio and video together, like generating a video of someone speaking along with their voice, and it does so much faster than previous methods.
What's the problem?
Creating audio and video at the same time is hard because you need the model to understand how sound and visuals relate to each other. Existing methods that generate things quickly often sacrifice quality, and usually require a lot of steps and training from a pre-existing, more complex model. The goal was to build a system that could generate high-quality audio and video *directly* and *quickly* without needing a complicated setup.
What's the solution?
The researchers used a two-part approach. First, they trained separate models to generate audio and video individually. Then, they combined these models and trained them together to understand the connection between the two. The key innovation, Mutual Forcing, allows the model to learn from both short-term and long-term relationships in the audio and video, and importantly, it improves itself during training by essentially 'teaching' itself. It does this by predicting both a little bit into the future and further into the future, and using the longer prediction to refine the shorter one. This avoids needing a separate, already-trained 'teacher' model.
Why it matters?
This work is important because it makes generating synchronized audio and video much more efficient. It achieves similar or better results than existing methods but requires significantly fewer steps, meaning it's faster and uses less computing power. This could be useful for creating realistic virtual assistants, generating content for games, or even helping people with communication difficulties.
Abstract
In this work, we propose Mutual Forcing, a framework for fast autoregressive audio-video generation with long-horizon audio-video synchronization. Our approach addresses two key challenges: joint audio-video modeling and fast autoregressive generation. To ease joint audio-video optimization, we adopt a two-stage training strategy: we first train uni-modal generators and then couple them into a unified audio-video model for joint training on paired data. For streaming generation, we ask whether a native fast causal audio-video model can be trained directly, instead of following existing streaming distillation pipelines that typically train a bidirectional model first and then convert it into a causal generator through multiple distillation stages. Our answer is Mutual Forcing, which builds directly on native autoregressive model and integrates few-step and multi-step generation within a single weight-shared model, enabling self-distillation and improved training-inference consistency. The multi-step mode improves the few-step mode via self-distillation, while the few-step mode generates historical context during training to improve training-inference consistency; because the two modes share parameters, these two effects reinforce each other within a single model. Compared with prior approaches such as Self-Forcing, Mutual Forcing removes the need for an additional bidirectional teacher model, supports more flexible training sequence lengths, reduces training overhead, and allows the model to improve directly from real paired data rather than a fixed teacher. Experiments show that Mutual Forcing matches or surpasses strong baselines that require around 50 sampling steps while using only 4 to 8 steps, demonstrating substantial advantages in both efficiency and quality. The project page is available at https://mutualforcing.github.io.