The model delivers exceptional behavioral vividness and perceptual realism, capturing subtle human nuances for natural transitions across complex interactive states. It maintains high-fidelity synthesis across diverse character styles from a single reference image. FlowAct-R1 consists of training and inference stages, including converting base full-attention DiT to streaming AR model via autoregressive adaptation and joint audio-motion finetuning for better lip-sync and body motion.
FlowAct-R1 exhibits highly responsive interaction capabilities, demonstrating significant potential to empower real-time, low-latency instant communication scenarios. It is robust to various character and motion styles, and outperforms state-of-the-art methods in human preference evaluation. The framework enables infinite durations for truly seamless interaction, making it suitable for applications such as livestreaming and video conferencing. It achieves real-time streaming, infinite-duration generation, and superior behavioral naturalness.


