A key innovation of Uni3C is its global 3D world guidance system, which aligns both scene geometry and human characters into a unified coordinate space during inference. This alignment enables seamless, 3D-consistent generation of videos where camera trajectories and human motions are interdependent and can be controlled together. The system uses scenic point clouds for camera control and SMPL-X characters for human animation, bridging their relationship through 2D keypoints and rigid transformations. This approach allows for complex motion transfer, including scenarios where reference motions are sourced from different videos or domains, such as animation and real-world footage. Uni3C has demonstrated remarkable generalization and robustness in challenging scenarios, including the ability to control detailed hand movements and adapt to dynamic camera viewpoints.
Extensive benchmarking shows that Uni3C significantly outperforms existing methods in both camera controllability and human motion quality. The framework has been validated on newly developed datasets featuring challenging camera movements and intricate human actions, as well as on out-of-distribution test sets with diverse camera trajectories. Uni3C’s modular design makes it compatible with various foundational video models, supporting flexible integration and downstream applications. While it excels in unified control, Uni3C does have limitations when human motions conflict with environmental constraints, potentially resulting in visual artifacts. Nevertheless, its contributions mark a significant advancement in controllable video generation, paving the way for more sophisticated, multi-modal content creation.
Key features include:
- Plug-and-play PCDController for precise 3D camera control using point clouds
- Unified global 3D world guidance for consistent camera and human motion control
- Supports complex motion transfer across different domains and video types
- Compatible with foundational video diffusion models for flexible integration
- Demonstrated superior performance on challenging benchmarks and new datasets
- Modular design for independent or joint training of camera and human motion modules

