3D Question Answering for City Scene Understanding
Penglei Sun, Yaoxian Song, Xiang Liu, Xiaofei Yang, Qiang Wang, Tiefeng Li, Yang Yang, Xiaowen Chu
2024-07-30

Summary
This paper presents a new approach called City-3DQA for understanding city scenes through 3D question answering. It introduces a dataset and a method to help AI systems better comprehend and answer questions about urban environments.
What's the problem?
Most research on understanding 3D environments has focused on indoor settings or specific outdoor tasks like driving. There has been little exploration of how to understand complex city scenes, which can be challenging due to the lack of detailed information about the layout and interactions between people and their surroundings. This makes it hard for AI models to accurately answer questions about city environments.
What's the solution?
To tackle these challenges, the authors created a new dataset called City-3DQA, which is designed specifically for city-level scene understanding. This dataset includes semantic information about different elements in the city and tasks that involve human interactions. They also developed a method called Scene graph enhanced City-level Understanding (Sg-CityU) that uses a scene graph to represent relationships between objects in the city. This method helps improve the accuracy of answers provided by AI models when they are asked questions about the city.
Why it matters?
This research is important because it enhances the ability of AI systems to understand and interact with complex urban environments. By providing a structured way to analyze city scenes and answer related questions, City-3DQA can lead to advancements in areas like smart city planning, autonomous navigation, and urban research, ultimately helping improve how we live and interact in cities.
Abstract
3D multimodal question answering (MQA) plays a crucial role in scene understanding by enabling intelligent agents to comprehend their surroundings in 3D environments. While existing research has primarily focused on indoor household tasks and outdoor roadside autonomous driving tasks, there has been limited exploration of city-level scene understanding tasks. Furthermore, existing research faces challenges in understanding city scenes, due to the absence of spatial semantic information and human-environment interaction information at the city level.To address these challenges, we investigate 3D MQA from both dataset and method perspectives. From the dataset perspective, we introduce a novel 3D MQA dataset named City-3DQA for city-level scene understanding, which is the first dataset to incorporate scene semantic and human-environment interactive tasks within the city. From the method perspective, we propose a Scene graph enhanced City-level Understanding method (Sg-CityU), which utilizes the scene graph to introduce the spatial semantic. A new benchmark is reported and our proposed Sg-CityU achieves accuracy of 63.94 % and 63.76 % in different settings of City-3DQA. Compared to indoor 3D MQA methods and zero-shot using advanced large language models (LLMs), Sg-CityU demonstrates state-of-the-art (SOTA) performance in robustness and generalization.