Exploring the Potential of Encoder-free Architectures in 3D LMMs
Yiwen Tang, Zoey Guo, Zhuhao Wang, Ray Zhang, Qizhi Chen, Junli Liu, Delin Qu, Zhigang Wang, Dong Wang, Xuelong Li, Bin Zhao
2025-02-14
Summary
This paper talks about a new way to make AI understand 3D images and shapes without using a special part called an encoder. The researchers created a system called ENEL that can do tasks like describing 3D objects and answering questions about them just as well as more complex systems.
What's the problem?
Current AI models that work with 3D data use a part called an encoder to understand the information. However, these encoders have trouble dealing with 3D images of different sizes and don't always give the AI the right kind of information to work with. This makes it hard for the AI to understand 3D shapes and objects properly.
What's the solution?
The researchers came up with two main strategies to solve this problem. First, they created a way for the AI to learn about 3D shapes during its initial training, using something called LLM-embedded Semantic Encoding. Second, they developed a method called Hierarchical Geometry Aggregation that helps the AI focus on important details in 3D images. Using these techniques, they built ENEL, an AI system that can understand 3D shapes without needing an encoder.
Why it matters?
This matters because it could make AI systems that work with 3D data simpler and more efficient. ENEL performs just as well as more complex systems on tasks like classifying 3D objects and describing them, even though it's smaller and doesn't use an encoder. This could lead to better AI for things like virtual reality, 3D modeling, and robotics, while using less computing power.
Abstract
Encoder-free architectures have been preliminarily explored in the 2D visual domain, yet it remains an open question whether they can be effectively applied to 3D understanding scenarios. In this paper, we present the first comprehensive investigation into the potential of encoder-free architectures to overcome the challenges of encoder-based 3D Large Multimodal Models (LMMs). These challenges include the failure to adapt to varying point cloud resolutions and the point features from the encoder not meeting the semantic needs of Large Language Models (LLMs). We identify key aspects for 3D LMMs to remove the encoder and enable the LLM to assume the role of the 3D encoder: 1) We propose the LLM-embedded Semantic Encoding strategy in the pre-training stage, exploring the effects of various point cloud self-supervised losses. And we present the Hybrid Semantic Loss to extract high-level semantics. 2) We introduce the Hierarchical Geometry Aggregation strategy in the instruction tuning stage. This incorporates inductive bias into the LLM early layers to focus on the local details of the point clouds. To the end, we present the first Encoder-free 3D LMM, ENEL. Our 7B model rivals the current state-of-the-art model, ShapeLLM-13B, achieving 55.0%, 50.92%, and 42.7% on the classification, captioning, and VQA tasks, respectively. Our results demonstrate that the encoder-free architecture is highly promising for replacing encoder-based architectures in the field of 3D understanding. The code is released at https://github.com/Ivan-Tang-3D/ENEL