3D-GRAND: A Million-Scale Dataset for 3D-LLMs with Better Grounding and Less Hallucination

Jianing Yang, Xuweiyi Chen, Nikhil Madaan, Madhavan Iyengar, Shengyi Qian, David F. Fouhey, Joyce Chai

2024-06-13

3D-GRAND: A Million-Scale Dataset for 3D-LLMs with Better Grounding and Less Hallucination

Summary

This paper introduces 3D-GRAND, a new dataset designed to improve how large language models (LLMs) understand and interact with 3D environments. It provides a large collection of household scenes and detailed instructions that help these models learn better.

What's the problem?

Current language models struggle to accurately connect language with 3D objects and scenes. This limitation makes it difficult for robots and AI systems to understand and interact with the physical world. Additionally, many existing datasets lack the detailed connections needed for effective learning, leading to errors or 'hallucinations' where the model generates unrealistic information.

What's the solution?

The authors created the 3D-GRAND dataset, which includes 40,087 household scenes and 6.2 million instructions that are closely linked to specific objects in those scenes. This dataset helps train models more effectively by providing clear examples of how language relates to 3D environments. They also introduced a benchmark called 3D-POPE to evaluate how well these models reduce hallucinations and improve grounding in their responses.

Why it matters?

This research is important because it enhances the ability of AI systems to understand and navigate the real world by improving their grounding in 3D contexts. By providing a comprehensive dataset and evaluation tools, the authors aim to advance the development of more reliable AI applications in robotics, virtual assistants, and other areas where understanding physical spaces is crucial.

Abstract

The integration of language and 3D perception is crucial for developing embodied agents and robots that comprehend and interact with the physical world. While large language models (LLMs) have demonstrated impressive language understanding and generation capabilities, their adaptation to 3D environments (3D-LLMs) remains in its early stages. A primary challenge is the absence of large-scale datasets that provide dense grounding between language and 3D scenes. In this paper, we introduce 3D-GRAND, a pioneering large-scale dataset comprising 40,087 household scenes paired with 6.2 million densely-grounded scene-language instructions. Our results show that instruction tuning with 3D-GRAND significantly enhances grounding capabilities and reduces hallucinations in 3D-LLMs. As part of our contributions, we propose a comprehensive benchmark 3D-POPE to systematically evaluate hallucination in 3D-LLMs, enabling fair comparisons among future models. Our experiments highlight a scaling effect between dataset size and 3D-LLM performance, emphasizing the critical role of large-scale 3D-text datasets in advancing embodied AI research. Notably, our results demonstrate early signals for effective sim-to-real transfer, indicating that models trained on large synthetic data can perform well on real-world 3D scans. Through 3D-GRAND and 3D-POPE, we aim to equip the embodied AI community with essential resources and insights, setting the stage for more reliable and better-grounded 3D-LLMs. Project website: https://3d-grand.github.io

View Paper