3DGraphLLM: Combining Semantic Graphs and Large Language Models for 3D Scene Understanding

Tatiana Zemskova, Dmitry Yudin

2024-12-25

3DGraphLLM: Combining Semantic Graphs and Large Language Models for 3D Scene Understanding

Summary

This paper talks about 3DGraphLLM, a new method that combines semantic graphs and large language models (LLMs) to improve understanding of 3D scenes, making it easier for robots to interact with their environment.

What's the problem?

Understanding complex 3D scenes is challenging for AI because traditional methods often only focus on the positions of objects without considering how they relate to each other. This lack of context can lead to misunderstandings when robots need to respond to questions about their surroundings or perform tasks based on visual input.

What's the solution?

The authors propose 3DGraphLLM, which creates a learnable representation of a 3D scene graph. This graph includes information about the objects in the scene and the relationships between them, allowing LLMs to better understand context. By using this method, they can perform tasks like identifying objects and responding to queries more accurately. The researchers tested their approach on several datasets and found that it outperformed existing methods that didn't consider these relationships.

Why it matters?

This research is important because it enhances how AI systems, especially robots, understand and interact with complex environments. By improving the ability of robots to interpret 3D scenes, 3DGraphLLM could lead to better performance in applications like autonomous vehicles, virtual reality, and robotic assistants, making them more effective in real-world situations.

Abstract

A 3D scene graph represents a compact scene model, storing information about the objects and the semantic relationships between them, making its use promising for robotic tasks. When interacting with a user, an embodied intelligent agent should be capable of responding to various queries about the scene formulated in natural language. Large Language Models (LLMs) are beneficial solutions for user-robot interaction due to their natural language understanding and reasoning abilities. Recent methods for creating learnable representations of 3D scenes have demonstrated the potential to improve the quality of LLMs responses by adapting to the 3D world. However, the existing methods do not explicitly utilize information about the semantic relationships between objects, limiting themselves to information about their coordinates. In this work, we propose a method 3DGraphLLM for constructing a learnable representation of a 3D scene graph. The learnable representation is used as input for LLMs to perform 3D vision-language tasks. In our experiments on popular ScanRefer, RIORefer, Multi3DRefer, ScanQA, Sqa3D, and Scan2cap datasets, we demonstrate the advantage of this approach over baseline methods that do not use information about the semantic relationships between objects. The code is publicly available at https://github.com/CognitiveAISystems/3DGraphLLM.

View Paper