Dynamic 3D Gaussian Tracking for Graph-Based Neural Dynamics Modeling
Mingtong Zhang, Kaifeng Zhang, Yunzhu Li
2024-10-28

Summary
This paper introduces a new method for tracking and predicting the movements of objects in videos using advanced 3D modeling techniques, specifically focusing on how robots interact with various objects.
What's the problem?
Current video prediction methods often overlook important 3D information, such as how robots move and how objects change shape during interactions. This lack of consideration limits the effectiveness of these models in real-world robotics applications, where understanding the full dynamics of objects is crucial for tasks like manipulation and control.
What's the solution?
The authors propose a framework that learns object dynamics directly from videos taken from multiple angles (multi-view RGB videos). They use a technique called 3D Gaussian Splatting to create a particle-based model that captures how objects behave when manipulated by robots. This model can predict how objects will move under different conditions and robot actions, allowing for more accurate simulations of object interactions. They tested their method on various deformable materials like ropes and clothes to demonstrate its effectiveness.
Why it matters?
This research is important because it enhances our ability to predict and understand how robots interact with objects in a realistic way. By improving video prediction techniques, this method can lead to better robotic systems that can perform tasks more effectively, making advancements in fields like robotics, automation, and artificial intelligence.
Abstract
Videos of robots interacting with objects encode rich information about the objects' dynamics. However, existing video prediction approaches typically do not explicitly account for the 3D information from videos, such as robot actions and objects' 3D states, limiting their use in real-world robotic applications. In this work, we introduce a framework to learn object dynamics directly from multi-view RGB videos by explicitly considering the robot's action trajectories and their effects on scene dynamics. We utilize the 3D Gaussian representation of 3D Gaussian Splatting (3DGS) to train a particle-based dynamics model using Graph Neural Networks. This model operates on sparse control particles downsampled from the densely tracked 3D Gaussian reconstructions. By learning the neural dynamics model on offline robot interaction data, our method can predict object motions under varying initial configurations and unseen robot actions. The 3D transformations of Gaussians can be interpolated from the motions of control particles, enabling the rendering of predicted future object states and achieving action-conditioned video prediction. The dynamics model can also be applied to model-based planning frameworks for object manipulation tasks. We conduct experiments on various kinds of deformable materials, including ropes, clothes, and stuffed animals, demonstrating our framework's ability to model complex shapes and dynamics. Our project page is available at https://gs-dynamics.github.io.