MoCapAnything: Unified 3D Motion Capture for Arbitrary Skeletons from Monocular Videos

Kehong Gong, Zhengyu Wen, Weixia He, Mingxi Xu, Qi Wang, Ning Zhang, Zhengyu Li, Dongze Lian, Wei Zhao, Xiaoyu He, Mingyuan Zhang

2025-12-12

MoCapAnything: Unified 3D Motion Capture for Arbitrary Skeletons from Monocular Videos

Summary

This paper introduces a new method called MoCapAnything that aims to create 3D animations from regular videos, but with a key difference: it can work with *any* 3D character model you give it, not just ones it was specifically trained on.

What's the problem?

Currently, motion capture technology often requires specific setups or is designed to work with only certain types of characters. If you have a unique 3D model, getting it to realistically mimic motion from a video is difficult because existing systems are too specialized. They're built for humans, or for a specific creature, and don't easily adapt to new 'bodies'.

What's the solution?

MoCapAnything tackles this by breaking down the process into stages. First, it analyzes the video to understand the movement. Then, it uses information about the 3D character you provide – its skeleton, shape, and even how it looks when rendered – to figure out how to best apply that movement. It predicts where the joints should be and then uses a technique called inverse kinematics to calculate the correct rotations for the character's limbs, ensuring a natural-looking animation. They also created a large dataset called Truebones Zoo to help train and test their system.

Why it matters?

This research is important because it makes 3D animation much more accessible and flexible. Instead of needing expensive motion capture suits or being limited to pre-made animations, creators can now potentially animate any 3D character simply by providing a video. This opens up possibilities for game development, filmmaking, and other areas where custom animation is crucial, and it allows for easy transfer of motion between very different types of characters.

Abstract

Motion capture now underpins content creation far beyond digital humans, yet most existing pipelines remain species- or template-specific. We formalize this gap as Category-Agnostic Motion Capture (CAMoCap): given a monocular video and an arbitrary rigged 3D asset as a prompt, the goal is to reconstruct a rotation-based animation such as BVH that directly drives the specific asset. We present MoCapAnything, a reference-guided, factorized framework that first predicts 3D joint trajectories and then recovers asset-specific rotations via constraint-aware inverse kinematics. The system contains three learnable modules and a lightweight IK stage: (1) a Reference Prompt Encoder that extracts per-joint queries from the asset's skeleton, mesh, and rendered images; (2) a Video Feature Extractor that computes dense visual descriptors and reconstructs a coarse 4D deforming mesh to bridge the gap between video and joint space; and (3) a Unified Motion Decoder that fuses these cues to produce temporally coherent trajectories. We also curate Truebones Zoo with 1038 motion clips, each providing a standardized skeleton-mesh-render triad. Experiments on both in-domain benchmarks and in-the-wild videos show that MoCapAnything delivers high-quality skeletal animations and exhibits meaningful cross-species retargeting across heterogeneous rigs, enabling scalable, prompt-driven 3D motion capture for arbitrary assets. Project page: https://animotionlab.github.io/MoCapAnything/

View Paper