Captain Safari: A World Engine

Yu-Cheng Chou, Xingrui Wang, Yitong Li, Jiahao Wang, Hanting Liu, Cihang Xie, Alan Yuille, Junfei Xiao

2025-12-01

Summary

This paper introduces a new system called Captain Safari that creates realistic, long videos of 3D scenes where you can freely move the camera around, like in a video game or with a drone.

What's the problem?

Existing systems for creating these kinds of videos struggle when you ask them to follow complicated camera paths, especially in large outdoor environments. They often create videos where things don't line up correctly in 3D, the camera doesn't actually follow the path you want, or the movements are very limited and cautious to avoid errors.

What's the solution?

Captain Safari solves this by using a 'world memory' – essentially, it stores pieces of the scene and quickly retrieves the right parts as the camera moves. It focuses on remembering what the scene looks like from different viewpoints and uses that information to generate the video, ensuring the 3D world stays consistent and the camera follows the desired path. They also created a new dataset of drone videos, called OpenSafari, to test these kinds of systems.

Why it matters?

This work is important because it makes creating realistic and interactive 3D videos much more achievable. It opens the door for better virtual reality experiences, more realistic simulations, and improved ways to explore environments remotely, like using drones. The new dataset also provides a standard way to compare and improve future video generation systems.

Abstract

World engines aim to synthesize long, 3D-consistent videos that support interactive exploration of a scene under user-controlled camera motion. However, existing systems struggle under aggressive 6-DoF trajectories and complex outdoor layouts: they lose long-range geometric coherence, deviate from the target path, or collapse into overly conservative motion. To this end, we introduce Captain Safari, a pose-conditioned world engine that generates videos by retrieving from a persistent world memory. Given a camera path, our method maintains a dynamic local memory and uses a retriever to fetch pose-aligned world tokens, which then condition video generation along the trajectory. This design enables the model to maintain stable 3D structure while accurately executing challenging camera maneuvers. To evaluate this setting, we curate OpenSafari, a new in-the-wild FPV dataset containing high-dynamic drone videos with verified camera trajectories, constructed through a multi-stage geometric and kinematic validation pipeline. Across video quality, 3D consistency, and trajectory following, Captain Safari substantially outperforms state-of-the-art camera-controlled generators. It reduces MEt3R from 0.3703 to 0.3690, improves AUC@30 from 0.181 to 0.200, and yields substantially lower FVD than all camera-controlled baselines. More importantly, in a 50-participant, 5-way human study where annotators select the best result among five anonymized models, 67.6% of preferences favor our method across all axes. Our results demonstrate that pose-conditioned world memory is a powerful mechanism for long-horizon, controllable video generation and provide OpenSafari as a challenging new benchmark for future world-engine research.

View Paper