Knot Forcing: Taming Autoregressive Video Diffusion Models for Real-time Infinite Interactive Portrait Animation
Steven Xiao, Xindi Zhang, Dechao Meng, Qi Wang, Peng Zhang, Bang Zhang
2025-12-30
Summary
This paper introduces a new method, called Knot Forcing, for creating realistic, real-time animations of faces. It's designed for applications like virtual assistants or creating live avatars that respond to your expressions or a reference image.
What's the problem?
Existing methods for animating faces in real-time have drawbacks. Techniques that produce high-quality results often can't run fast enough for interactive use because they need to see the entire video before generating anything. Faster methods that generate frame-by-frame can suffer from errors building up over time, leading to jerky movements and inconsistencies, especially in longer animations.
What's the solution?
Knot Forcing tackles these issues in three main ways. First, it generates the animation in small chunks, remembering key details from the original reference image to maintain a consistent identity. Second, it uses a 'temporal knot' to blend these chunks together smoothly, preventing abrupt transitions. Finally, it cleverly adjusts the reference image over time to ensure the animation stays coherent and realistic even over long sequences.
Why it matters?
This research is important because it allows for high-quality, real-time facial animation on standard computer hardware. This opens up possibilities for more engaging and responsive virtual experiences, like more natural-looking avatars in video calls or more believable virtual assistants.
Abstract
Real-time portrait animation is essential for interactive applications such as virtual assistants and live avatars, requiring high visual fidelity, temporal coherence, ultra-low latency, and responsive control from dynamic inputs like reference images and driving signals. While diffusion-based models achieve strong quality, their non-causal nature hinders streaming deployment. Causal autoregressive video generation approaches enable efficient frame-by-frame generation but suffer from error accumulation, motion discontinuities at chunk boundaries, and degraded long-term consistency. In this work, we present a novel streaming framework named Knot Forcing for real-time portrait animation that addresses these challenges through three key designs: (1) a chunk-wise generation strategy with global identity preservation via cached KV states of the reference image and local temporal modeling using sliding window attention; (2) a temporal knot module that overlaps adjacent chunks and propagates spatio-temporal cues via image-to-video conditioning to smooth inter-chunk motion transitions; and (3) A "running ahead" mechanism that dynamically updates the reference frame's temporal coordinate during inference, keeping its semantic context ahead of the current rollout frame to support long-term coherence. Knot Forcing enables high-fidelity, temporally consistent, and interactive portrait animation over infinite sequences, achieving real-time performance with strong visual stability on consumer-grade GPUs.