UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal Models in Multi-View Urban Scenarios

Baichuan Zhou, Haote Yang, Dairong Chen, Junyan Ye, Tianyi Bai, Jinhua Yu, Songyang Zhang, Dahua Lin, Conghui He, Weijia Li

2024-09-02

UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal Models in Multi-View Urban Scenarios

Summary

This paper presents UrBench, a new benchmark designed to evaluate large multimodal models in complex urban environments using multiple views.

What's the problem?

Most existing benchmarks for testing large multimodal models focus on basic tasks in urban settings but only use single views of the environment. This limits the ability to assess how well these models can handle real-world scenarios where multiple perspectives are crucial.

What's the solution?

UrBench addresses this problem by providing a comprehensive evaluation framework that includes 11,600 carefully curated questions covering various tasks related to urban environments, such as geo-localization and scene understanding. The benchmark uses data collected from 11 cities and includes both region-level and role-level tasks, allowing for a more thorough assessment of model performance across different urban scenarios.

Why it matters?

This research is important because it helps improve the evaluation of large multimodal models, ensuring they can effectively understand and interact with complex urban environments. By highlighting the strengths and weaknesses of these models, UrBench can guide future developments in AI technologies that rely on accurate urban data interpretation.

Abstract

Recent evaluations of Large Multimodal Models (LMMs) have explored their capabilities in various domains, with only few benchmarks specifically focusing on urban environments. Moreover, existing urban benchmarks have been limited to evaluating LMMs with basic region-level urban tasks under singular views, leading to incomplete evaluations of LMMs' abilities in urban environments. To address these issues, we present UrBench, a comprehensive benchmark designed for evaluating LMMs in complex multi-view urban scenarios. UrBench contains 11.6K meticulously curated questions at both region-level and role-level that cover 4 task dimensions: Geo-Localization, Scene Reasoning, Scene Understanding, and Object Understanding, totaling 14 task types. In constructing UrBench, we utilize data from existing datasets and additionally collect data from 11 cities, creating new annotations using a cross-view detection-matching method. With these images and annotations, we then integrate LMM-based, rule-based, and human-based methods to construct large-scale high-quality questions. Our evaluations on 21 LMMs show that current LMMs struggle in the urban environments in several aspects. Even the best performing GPT-4o lags behind humans in most tasks, ranging from simple tasks such as counting to complex tasks such as orientation, localization and object attribute recognition, with an average performance gap of 17.4%. Our benchmark also reveals that LMMs exhibit inconsistent behaviors with different urban views, especially with respect to understanding cross-view relations. UrBench datasets and benchmark results will be publicly available at https://opendatalab.github.io/UrBench/.

View Paper