MonoArt: Progressive Structural Reasoning for Monocular Articulated 3D Reconstruction

Haitian Li, Haozhe Xie, Junxiang Xu, Beichen Wen, Fangzhou Hong, Ziwei Liu

2026-03-20

MonoArt: Progressive Structural Reasoning for Monocular Articulated 3D Reconstruction

Summary

This paper introduces a new method, MonoArt, for figuring out the 3D shape and how the parts move in articulated objects – things like robots or toys – just by looking at a single picture.

What's the problem?

It's really hard to understand how a 3D object is put together and how its parts move just from one image. The way something moves can look like it's a part of the object's structure, and vice versa, making it difficult for computers to accurately determine both. Previous solutions often needed lots of images from different angles, relied on finding similar objects to copy from, or tried to create videos to help, but these methods weren't always practical or fast.

What's the solution?

MonoArt solves this by taking a picture and gradually breaking it down into its core components. Instead of directly guessing how the object moves, it first transforms the image into a standard geometric shape, then identifies the individual parts, and finally creates a representation that understands how those parts relate to movement. This step-by-step process makes the system more stable and easier to understand, without needing extra information like motion examples or complicated multi-step processes.

Why it matters?

This work is important because it allows computers to understand and reconstruct complex, moving objects more accurately and quickly than before. This has potential applications in areas like robotics, where robots need to understand how to manipulate objects, and in creating realistic 3D scenes for virtual reality or computer graphics.

Abstract

Reconstructing articulated 3D objects from a single image requires jointly inferring object geometry, part structure, and motion parameters from limited visual evidence. A key difficulty lies in the entanglement between motion cues and object structure, which makes direct articulation regression unstable. Existing methods address this challenge through multi-view supervision, retrieval-based assembly, or auxiliary video generation, often sacrificing scalability or efficiency. We present MonoArt, a unified framework grounded in progressive structural reasoning. Rather than predicting articulation directly from image features, MonoArt progressively transforms visual observations into canonical geometry, structured part representations, and motion-aware embeddings within a single architecture. This structured reasoning process enables stable and interpretable articulation inference without external motion templates or multi-stage pipelines. Extensive experiments on PartNet-Mobility demonstrate that OM achieves state-of-the-art performance in both reconstruction accuracy and inference speed. The framework further generalizes to robotic manipulation and articulated scene reconstruction.

View Paper