The model is built upon the 3D-aware discrete tokens and is trained on a large-scale continuous training dataset named 3D-Alpaca. This dataset encompasses generation, comprehension, and editing, providing rich resources for future research and training. The 3D-Alpaca dataset is a comprehensive foundation for training and evaluating 3D large language models. ShapeLLM-Omni inherits Qwen2.5-vl’s strong multimodal capabilities and additionally supports text-to-3D, image-to-3D, 3D captioning, and 3D editing using text instruction.
The model has been demonstrated to have impressive qualitative results, with examples of text-to-3D and image-to-3D generation. A demo is also available, showcasing the model's capabilities in image-to-3D, text-to-3D, and 3D understanding. The demo allows users to try out the model's capabilities and see its potential applications. Overall, ShapeLLM-Omni is a significant step towards extending multimodal models with basic 3D capabilities, contributing to future research in 3D-native AI.