The N-Body Problem: Parallel Execution from Single-Person Egocentric Video

Zhifan Zhu, Yifei Huang, Yoichi Sato, Dima Damen

2025-12-15

The N-Body Problem: Parallel Execution from Single-Person Egocentric Video

Summary

This paper explores whether an AI can figure out how to split up tasks from a video of one person doing them, so multiple people could do those tasks at the same time, more efficiently.

What's the problem?

Imagine watching someone make a sandwich. It seems simple, but if you wanted two people to make the same sandwich *at the same time*, you'd run into problems. Both people can't grab the bread at once, or use the knife simultaneously. The challenge is to take a single video of a task and figure out how to divide it among multiple people without creating impossible situations like collisions or conflicts over objects.

What's the solution?

The researchers created a way to ask a powerful AI model (a Vision-Language Model) to think through the video like a 3D planner. They gave the AI specific instructions to consider the environment, how objects are used, and the order things need to happen in. This helps the AI create a realistic plan for multiple people to work together, avoiding those impossible scenarios.

Why it matters?

This research is important because humans are naturally good at understanding how to work together and split up tasks. If we can teach AI to do the same, it could lead to robots or virtual assistants that can collaborate with people more effectively, making things like cooking, building, or even performing surgery much faster and more efficient.

Abstract

Humans can intuitively parallelise complex activities, but can a model learn this from observing a single person? Given one egocentric video, we introduce the N-Body Problem: how N individuals, can hypothetically perform the same set of tasks observed in this video. The goal is to maximise speed-up, but naive assignment of video segments to individuals often violates real-world constraints, leading to physically impossible scenarios like two people using the same object or occupying the same space. To address this, we formalise the N-Body Problem and propose a suite of metrics to evaluate both performance (speed-up, task coverage) and feasibility (spatial collisions, object conflicts and causal constraints). We then introduce a structured prompting strategy that guides a Vision-Language Model (VLM) to reason about the 3D environment, object usage, and temporal dependencies to produce a viable parallel execution. On 100 videos from EPIC-Kitchens and HD-EPIC, our method for N = 2 boosts action coverage by 45% over a baseline prompt for Gemini 2.5 Pro, while simultaneously slashing collision rates, object and causal conflicts by 55%, 45% and 55% respectively.

View Paper