The model is built upon the 3D-aware discrete tokens and is trained on a large-scale continuous training dataset named 3D-Alpaca. This dataset encompasses generation, comprehension, and editing, providing rich resources for future research and training. The 3D-Alpaca dataset is a comprehensive foundation for training and evaluating 3D large language models. ShapeLLM-Omni inherits Qwen2.5-vl’s strong multimodal capabilities and additionally supports text-to-3D, image-to-3D, 3D captioning, and 3D editing using text instruction.


The model has been demonstrated to have impressive qualitative results, with examples of text-to-3D and image-to-3D generation. A demo is also available, showcasing the model's capabilities in image-to-3D, text-to-3D, and 3D understanding. The demo allows users to try out the model's capabilities and see its potential applications. Overall, ShapeLLM-Omni is a significant step towards extending multimodal models with basic 3D capabilities, contributing to future research in 3D-native AI.

Key Features

3D vector-quantized variational autoencoder (VQVAE)
3D-aware discrete tokens
Supports text-to-3D, image-to-3D, 3D captioning, and 3D editing
Trained on large-scale continuous training dataset 3D-Alpaca
Comprehensive foundation for training and evaluating 3D large language models
Inherits Qwen2.5-vl’s strong multimodal capabilities
Demo available for trying out model's capabilities
Contributes to future research in 3D-native AI

Get more likes & reach the top of search results by adding this button on your site!

Embed button preview - Light theme
Embed button preview - Dark theme

Subscribe to the AI Search Newsletter

Get top updates in AI to your inbox every weekend. It's free!