Part-X-MLLM: Part-aware 3D Multimodal Large Language Model
Chunshi Wang, Junliang Ye, Yunhan Yang, Yang Li, Zizhuo Lin, Jun Zhu, Zhuo Chen, Yawei Luo, Chunchao Guo
2025-11-18
Summary
This paper introduces a new AI model called Part-X-MLLM that can understand and work with 3D objects using both images and natural language. It's designed to handle different 3D tasks in a unified way, like answering questions about 3D scenes, creating new 3D objects, and editing existing ones.
What's the problem?
Currently, working with 3D objects and AI is complicated because different tasks require different approaches. It's hard to have one system that can both *understand* what you want to do with a 3D object (through language) and then *actually do it* (manipulate the 3D shape). Existing methods often tightly link the 'thinking' part with the 'doing' part, making it inflexible.
What's the solution?
The researchers created Part-X-MLLM, which works by first translating a user's instructions (in plain language) and a 3D scan of an object into a special kind of code – a structured plan. This plan details things like where different parts of the object are, what those parts are called, and what changes to make. This plan is then used by a separate 3D processing tool to actually modify or create the object. By separating the planning and the execution, they can easily swap out different 3D tools without changing the core AI model. They trained the model using a lot of data showing objects broken down into their parts.
Why it matters?
This research is important because it simplifies how we interact with 3D objects using AI. It allows for more flexible and powerful 3D creation and editing tools. Because the AI generates a plan, it's easier to understand *why* the AI is doing something, and it opens the door to using different 3D software with the same AI model, making it a versatile tool for designers, engineers, and anyone working with 3D content.
Abstract
We introduce Part-X-MLLM, a native 3D multimodal large language model that unifies diverse 3D tasks by formulating them as programs in a structured, executable grammar. Given an RGB point cloud and a natural language prompt, our model autoregressively generates a single, coherent token sequence encoding part-level bounding boxes, semantic descriptions, and edit commands. This structured output serves as a versatile interface to drive downstream geometry-aware modules for part-based generation and editing. By decoupling the symbolic planning from the geometric synthesis, our approach allows any compatible geometry engine to be controlled through a single, language-native frontend. We pre-train a dual-encoder architecture to disentangle structure from semantics and instruction-tune the model on a large-scale, part-centric dataset. Experiments demonstrate that our model excels at producing high-quality, structured plans, enabling state-of-the-art performance in grounded Q\&A, compositional generation, and localized editing through one unified interface. Project page: https://chunshi.wang/Part-X-MLLM/