LSceneLLM: Enhancing Large 3D Scene Understanding Using Adaptive Visual Preferences
Hongyan Zhi, Peihao Chen, Junyan Li, Shuailei Ma, Xinyu Sun, Tianhang Xiang, Yinjie Lei, Mingkui Tan, Chuang Gan
2024-12-04

Summary
This paper introduces LSceneLLM, a new framework designed to improve the understanding of large 3D scenes by using adaptive visual preferences to focus on relevant details.
What's the problem?
Understanding complex 3D scenes is challenging because they contain a lot of visual information. Existing methods often try to analyze all objects in a scene, but this can lead to too much unnecessary information and miss important details needed for specific tasks. This makes it hard for AI systems to accurately locate and understand the relevant parts of a scene.
What's the solution?
LSceneLLM addresses this issue by automatically identifying the most important areas in a 3D scene based on the task at hand. It uses a special module called a 'scene magnifier' that helps zoom in on these areas to capture finer details. The framework includes a dense token selector that looks at what the AI model finds important and then focuses on those specific parts of the scene. Additionally, LSceneLLM introduces a new benchmark called XR-Scene to evaluate how well AI can understand large scenes through various tasks.
Why it matters?
This research is significant because it enhances how AI systems can interpret and interact with complex 3D environments. By improving scene understanding, LSceneLLM can benefit applications such as robotics, virtual reality, and gaming, where accurate navigation and interaction within 3D spaces are crucial.
Abstract
Research on 3D Vision-Language Models (3D-VLMs) is gaining increasing attention, which is crucial for developing embodied AI within 3D scenes, such as visual navigation and embodied question answering. Due to the high density of visual features, especially in large 3D scenes, accurately locating task-relevant visual information is challenging. Existing works attempt to segment all objects and consider their features as scene representations. However, these task-agnostic object features include much redundant information and missing details for the task-relevant area. To tackle these problems, we propose LSceneLLM, an adaptive framework that automatically identifies task-relevant areas by leveraging LLM's visual preference for different tasks, followed by a plug-and-play scene magnifier module to capture fine-grained details in focused areas. Specifically, a dense token selector examines the attention map of LLM to identify visual preferences for the instruction input. It then magnifies fine-grained details of the focusing area. An adaptive self-attention module is leveraged to fuse the coarse-grained and selected fine-grained visual information. To comprehensively evaluate the large scene understanding ability of 3D-VLMs, we further introduce a cross-room understanding benchmark, XR-Scene, which contains a series of large scene understanding tasks including XR-QA, XR-EmbodiedPlanning, and XR-SceneCaption. Experiments show that our method surpasses existing methods on both large scene understanding and existing scene understanding benchmarks. Plunging our scene magnifier module into the existing 3D-VLMs also brings significant improvement.