The core of ACTalker’s architecture is a parallel mamba structure with multiple branches, each dedicated to a different control signal—such as audio or facial motion. Each branch manipulates feature tokens across both temporal and spatial dimensions, ensuring that input signals can independently influence specific facial regions. A gating mechanism across all branches provides flexible and dynamic control, enabling seamless switching between single and multi-signal modes. To further prevent conflicts and ensure natural coordination, ACTalker introduces a mask-drop strategy, which allows each control signal to focus solely on its assigned facial areas. This approach ensures that audio-driven lip movements and motion-driven expressions remain distinct yet harmonized in the generated video.
ACTalker incorporates a selective state-space model (SSM) to efficiently aggregate contextual information from various signals, replacing traditional attention mechanisms for improved computational efficiency. The model’s design also preserves the identity of the source subject, blending identity features with noise to maintain consistent appearance across generated frames. The system uses a VAE encoder for image encoding, Whisper for audio embedding, and a motion encoder for extracting facial motion cues. Extensive experiments and ablation studies have demonstrated that ACTalker produces highly realistic, temporally coherent talking head videos and outperforms existing methods in both single- and multi-signal control scenarios. This makes it a powerful tool for anyone seeking precise, expressive, and controllable talking head generation.
Key features include:
- End-to-end video diffusion framework for talking head generation
- Supports both multi-signal (audio and facial motion) and single-signal control
- Parallel mamba structure with region-specific control branches
- Gating mechanism for flexible signal management
- Mask-drop strategy to prevent control conflicts and enhance realism
- Selective state-space model for efficient contextual aggregation
- Identity preservation for consistent subject appearance