The method represents camera motion through a camera grid rendered from reference-video camera poses in an empty 3D space. During training, this camera grid is injected into an MMDiT with other controls, while a hierarchical prompt expansion agent integrates multimodal signals at inference.
OmniDirector is useful for video generation workflows that need to copy cinematic camera language, not just object motion. It can reproduce aerial fly-throughs, descents, dolly zooms, bullet-time effects, and lens-distortion-like camera behavior while preserving generated content.


