The framework of VerseCrafter consists of a frozen Wan2.1 backbone and a lightweight GeoAdapter that encodes the rendered 4D control maps and injects them into selected diffusion blocks. This design enables precise camera and multi-object motion control while maintaining sharp, geometrically coherent videos. The model is trained on the VerseControl4D dataset, which contains 35,000 training clips and 1,000 validation/test clips with complete geometric supervision.
VerseCrafter offers flexible 4D geometric control, allowing users to specify camera-only, object-only, or joint control modes. The model also features an interactive 4D control interface, where users can design custom camera trajectories and 3D Gaussian object trajectories within Blender. The resulting trajectories are exported as control maps and used by VerseCrafter for geometry-consistent, controllable video generation. The model produces consistent multi-view world dynamics with aligned camera and object motions.


