Bringing Objects to Life: 4D generation from 3D objects

Ohad Rahamim, Ori Malca, Dvir Samuel, Gal Chechik

2024-12-31

Bringing Objects to Life: 4D generation from 3D objects

Summary

This paper talks about a new method for creating 4D animations from 3D objects using text prompts, allowing users to customize how these objects move while keeping their original look.

What's the problem?

While there have been advancements in generating moving images (4D content) from static 3D models, existing methods often limit users' control over how the objects appear and move. This can lead to issues like repetitive animations or mistakes in the generated content, making it hard to create unique and realistic animations.

What's the solution?

To solve these problems, the authors introduce a method that first converts a 3D object into a 'static' 4D Neural Radiance Field (NeRF), which captures its appearance. Then, they animate the object based on text prompts using an Image-to-Video diffusion model. They also improve the realism of movement by selecting different viewpoints incrementally and using a technique called masked Score Distillation Sampling (SDS) to focus on important areas of the object during animation. This approach allows for custom animations while preserving the object's identity.

Why it matters?

This research is important because it enhances how we can animate 3D objects, making it easier for creators in fields like gaming, virtual reality, and media to produce high-quality animations. By providing more control over the appearance and motion of objects, this method can lead to more engaging and realistic experiences in digital environments.

Abstract

Recent advancements in generative modeling now enable the creation of 4D content (moving 3D objects) controlled with text prompts. 4D generation has large potential in applications like virtual worlds, media, and gaming, but existing methods provide limited control over the appearance and geometry of generated content. In this work, we introduce a method for animating user-provided 3D objects by conditioning on textual prompts to guide 4D generation, enabling custom animations while maintaining the identity of the original object. We first convert a 3D mesh into a ``static" 4D Neural Radiance Field (NeRF) that preserves the visual attributes of the input object. Then, we animate the object using an Image-to-Video diffusion model driven by text. To improve motion realism, we introduce an incremental viewpoint selection protocol for sampling perspectives to promote lifelike movement and a masked Score Distillation Sampling (SDS) loss, which leverages attention maps to focus optimization on relevant regions. We evaluate our model in terms of temporal coherence, prompt adherence, and visual fidelity and find that our method outperforms baselines that are based on other approaches, achieving up to threefold improvements in identity preservation measured using LPIPS scores, and effectively balancing visual quality with dynamic content.

View Paper