DIMO: Diverse 3D Motion Generation for Arbitrary Objects
Linzhan Mou, Jiahui Lei, Chen Wang, Lingjie Liu, Kostas Daniilidis
2025-11-11
Summary
This paper introduces DIMO, a new computer graphics technique that can create realistic 3D movements for any object, starting from just a single image.
What's the problem?
Normally, making a 3D object move realistically requires a lot of work, like painstakingly animating it frame by frame. It's hard to get diverse and natural motions, and it often needs a lot of input data, like multiple videos of the object moving in different ways. The challenge is to create believable 3D motion from limited information, like just one picture.
What's the solution?
The researchers used a clever trick: they took advantage of existing video models that have already 'learned' how things generally move. They generated many different possible motions for the object, then condensed these motions into a simpler, mathematical representation. This representation allows the computer to quickly create new, varied motions. Finally, they used this motion information to control the shape and appearance of the 3D object, making it look like it's actually moving. Essentially, they're borrowing knowledge from existing videos to make new motions.
Why it matters?
This is important because it makes creating 3D animations much easier and faster. Instead of needing tons of data or hours of manual animation, you can get realistic movement from a single image. This could be useful for things like video games, special effects in movies, or even creating virtual reality experiences where objects need to move naturally.
Abstract
We present DIMO, a generative approach capable of generating diverse 3D motions for arbitrary objects from a single image. The core idea of our work is to leverage the rich priors in well-trained video models to extract the common motion patterns and then embed them into a shared low-dimensional latent space. Specifically, we first generate multiple videos of the same object with diverse motions. We then embed each motion into a latent vector and train a shared motion decoder to learn the distribution of motions represented by a structured and compact motion representation, i.e., neural key point trajectories. The canonical 3D Gaussians are then driven by these key points and fused to model the geometry and appearance. During inference time with learned latent space, we can instantly sample diverse 3D motions in a single-forward pass and support several interesting applications including 3D motion interpolation and language-guided motion generation. Our project page is available at https://linzhanm.github.io/dimo.