SIMART: Decomposing Monolithic Meshes into Sim-ready Articulated Assets via MLLM

Chuanrui Zhang, Minghan Qin, Yuang Wang, Baifeng Xie, Hang Li, Ziwei Wang

2026-03-25

SIMART: Decomposing Monolithic Meshes into Sim-ready Articulated Assets via MLLM

Summary

This paper introduces a new system called SIMART that creates realistic, interactive 3D models of objects with moving parts, like robots or furniture, directly from instructions. It's designed to make it easier to build virtual worlds for artificial intelligence and simulations.

What's the problem?

Currently, creating these kinds of 3D models is difficult. Existing methods often involve multiple steps, each of which can introduce errors, leading to a final product that isn't quite right. Another approach uses a single step, but it requires a lot of computer memory because it represents 3D space in a very detailed way, making it hard to create complex objects. Essentially, it's hard to make detailed, interactive 3D models efficiently.

What's the solution?

SIMART solves this by using a new technique that breaks down objects into their individual parts and predicts how those parts connect and move. It also uses a smarter way to represent 3D space, called a Sparse 3D VQ-VAE, which significantly reduces the amount of memory needed – about 70% less than previous methods – without sacrificing quality. This allows SIMART to create more complex and detailed models.

Why it matters?

This work is important because high-quality 3D models are crucial for training AI to interact with the physical world and for running accurate simulations. By making it easier and more efficient to create these models, SIMART can help advance research in robotics, virtual reality, and other fields that rely on realistic 3D environments.

Abstract

High-quality articulated 3D assets are indispensable for embodied AI and physical simulation, yet 3D generation still focuses on static meshes, leaving a gap in "sim-ready" interactive objects. Most recent articulated object creation methods rely on multi-stage pipelines that accumulate errors across decoupled modules. Alternatively, unified MLLMs offer a single-stage path to joint static asset understanding and sim-ready asset generation. However dense voxel-based 3D tokenization yields long 3D token sequences and high memory overhead, limiting scalability to complex articulated objects. To address this, we propose SIMART, a unified MLLM framework that jointly performs part-level decomposition and kinematic prediction. By introducing a Sparse 3D VQ-VAE, SIMART reduces token counts by 70% vs. dense voxel tokens, enabling high-fidelity multi-part assemblies. SIMART achieves state-of-the-art performance on PartNet-Mobility and in-the-wild AIGC datasets, and enables physics-based robotic simulation.

View Paper