FlowAct-R1: Towards Interactive Humanoid Video Generation
Lizhen Wang, Yongming Zhu, Zhipeng Ge, Youwei Zheng, Longhao Zhang, Tianshu Hu, Shiyang Qin, Mingshuang Luo, Jiaxu Zhang, Xin Chen, Yulong Wang, Zerong Zheng, Jianwen Jiang, Chao Liang, Weifeng Chen, Xing Wang, Yuan Zhang, Mingyuan Gao
2026-01-16
Summary
This paper introduces a new system, called FlowAct-R1, for creating realistic videos of human-like characters that can interact with people in real-time.
What's the problem?
Creating videos of realistic characters that respond instantly to user input is really hard. Existing methods either make the videos look amazing but are too slow for a conversation, or they're fast enough for interaction but the video quality isn't very good. Basically, there's a trade-off between looking good and responding quickly, and keeping things consistent over longer interactions is also a challenge.
What's the solution?
The researchers developed FlowAct-R1, which uses a technique called diffusion modeling, building on an existing system called MMDiT. They improved the system by processing the video in smaller chunks and using a clever 'self-forcing' method to prevent errors from building up over time, ensuring the character's movements stay natural and consistent. They also optimized the system to run efficiently, allowing it to generate video at 25 frames per second with a very short delay before the video starts.
Why it matters?
This work is important because it brings us closer to creating truly interactive virtual characters. Imagine being able to have a realistic conversation with a digital person in a video game, or using a virtual assistant that feels more natural and responsive. FlowAct-R1 makes these kinds of applications more feasible by providing a way to generate high-quality, real-time interactive videos of humanoids.
Abstract
Interactive humanoid video generation aims to synthesize lifelike visual agents that can engage with humans through continuous and responsive video. Despite recent advances in video synthesis, existing methods often grapple with the trade-off between high-fidelity synthesis and real-time interaction requirements. In this paper, we propose FlowAct-R1, a framework specifically designed for real-time interactive humanoid video generation. Built upon a MMDiT architecture, FlowAct-R1 enables the streaming synthesis of video with arbitrary durations while maintaining low-latency responsiveness. We introduce a chunkwise diffusion forcing strategy, complemented by a novel self-forcing variant, to alleviate error accumulation and ensure long-term temporal consistency during continuous interaction. By leveraging efficient distillation and system-level optimizations, our framework achieves a stable 25fps at 480p resolution with a time-to-first-frame (TTFF) of only around 1.5 seconds. The proposed method provides holistic and fine-grained full-body control, enabling the agent to transition naturally between diverse behavioral states in interactive scenarios. Experimental results demonstrate that FlowAct-R1 achieves exceptional behavioral vividness and perceptual realism, while maintaining robust generalization across diverse character styles.