Animate Any Character in Any World
Yitong Wang, Fangyun Wei, Hongyang Zhang, Bo Dai, Yan Lu
2025-12-22
Summary
This paper introduces a new system called AniX that creates realistic videos of characters interacting with 3D environments based on simple text instructions.
What's the problem?
Currently, creating realistic simulations of characters in 3D worlds is difficult. Existing methods either create static environments without anyone moving around, or allow only very limited control of a single character. There wasn't a way to easily tell a character to do a wide variety of things in a detailed 3D scene and have it look believable over time.
What's the solution?
The researchers developed AniX, which combines the best parts of both existing approaches. It starts with a detailed 3D scene and a character, then uses natural language commands to direct the character's actions. AniX then generates a video showing the character performing those actions, making sure the movements look natural and the video stays consistent with the original scene and character. They improved a pre-existing video generation tool to make the character's movements more realistic and adaptable to different actions and characters.
Why it matters?
This work is important because it makes it easier to create realistic and interactive simulations. This could be useful for things like creating training simulations, designing video games, or even developing virtual assistants that can operate in complex environments. It's a step towards more natural and intuitive ways to interact with virtual worlds.
Abstract
Recent advances in world models have greatly enhanced interactive environment simulation. Existing methods mainly fall into two categories: (1) static world generation models, which construct 3D environments without active agents, and (2) controllable-entity models, which allow a single entity to perform limited actions in an otherwise uncontrollable environment. In this work, we introduce AniX, leveraging the realism and structural grounding of static world generation while extending controllable-entity models to support user-specified characters capable of performing open-ended actions. Users can provide a 3DGS scene and a character, then direct the character through natural language to perform diverse behaviors from basic locomotion to object-centric interactions while freely exploring the environment. AniX synthesizes temporally coherent video clips that preserve visual fidelity with the provided scene and character, formulated as a conditional autoregressive video generation problem. Built upon a pre-trained video generator, our training strategy significantly enhances motion dynamics while maintaining generalization across actions and characters. Our evaluation covers a broad range of aspects, including visual quality, character consistency, action controllability, and long-horizon coherence.