Agent-to-Sim: Learning Interactive Behavior Models from Casual Longitudinal Videos

Gengshan Yang, Andrea Bajcsy, Shunsuke Saito, Angjoo Kanazawa

2024-10-22

Agent-to-Sim: Learning Interactive Behavior Models from Casual Longitudinal Videos

Summary

This paper presents Agent-to-Sim (ATS), a new framework that learns how 3D agents, like humans and animals, behave by analyzing long videos recorded in a single environment.

What's the problem?

Understanding and simulating the behavior of agents in 3D environments is challenging because traditional methods often rely on complex setups with multiple cameras or special tracking markers. This can be invasive and impractical for studying natural behaviors over time. There is a need for a more straightforward way to learn these behaviors from regular video recordings without extra equipment.

What's the solution?

To tackle this issue, the authors developed ATS, which learns from casual videos taken over long periods (like a month). They created a method that tracks both the agent's movements and the camera's position to build a detailed 4D representation of the environment. This allows them to capture how agents behave in different situations. The framework then trains a model that can simulate these behaviors based on the video data, enabling the transfer of real-world actions into a virtual simulation.

Why it matters?

This research is significant because it allows for realistic simulations of agent behavior using easily obtainable video data. By making it possible to study and replicate how pets or people interact in their environments, this framework can be applied in various fields such as robotics, gaming, and virtual reality. It opens up new possibilities for creating interactive experiences that closely mimic real-life behavior.

Abstract

We present Agent-to-Sim (ATS), a framework for learning interactive behavior models of 3D agents from casual longitudinal video collections. Different from prior works that rely on marker-based tracking and multiview cameras, ATS learns natural behaviors of animal and human agents non-invasively through video observations recorded over a long time-span (e.g., a month) in a single environment. Modeling 3D behavior of an agent requires persistent 3D tracking (e.g., knowing which point corresponds to which) over a long time period. To obtain such data, we develop a coarse-to-fine registration method that tracks the agent and the camera over time through a canonical 3D space, resulting in a complete and persistent spacetime 4D representation. We then train a generative model of agent behaviors using paired data of perception and motion of an agent queried from the 4D reconstruction. ATS enables real-to-sim transfer from video recordings of an agent to an interactive behavior simulator. We demonstrate results on pets (e.g., cat, dog, bunny) and human given monocular RGBD videos captured by a smartphone.

View Paper