MARS2 2025 Challenge on Multimodal Reasoning: Datasets, Methods, Results, Discussion, and Outlook

Peng Xu, Shengwu Xiong, Jiajun Zhang, Yaxiong Chen, Bowen Zhou, Chen Change Loy, David A. Clifton, Kyoung Mu Lee, Luc Van Gool, Ruiming He, Ruilin Yao, Xinwei Long, Jirui Huang, Kai Tian, Sa Yang, Yihua Shao, Jin Feng, Yue Zhong, Jiakai Zhou, Cheng Tang, Tianyu Zou, Yifang Zhang

2025-09-18

MARS2 2025 Challenge on Multimodal Reasoning: Datasets, Methods, Results, Discussion, and Outlook

Summary

This paper details the MARS2 2025 Challenge, which is all about pushing the boundaries of multimodal machine learning – that’s getting computers to understand information from different sources like images and text at the same time. It’s designed to be a central place for researchers to track progress in this quickly evolving field.

What's the problem?

The field of multimodal learning, especially when combined with large language models, is moving incredibly fast. It was becoming difficult to compare different approaches and see what truly works best. Existing benchmarks weren’t focused enough on real-world applications or specialized tasks, limiting the development of more practical AI systems.

What's the solution?

To address this, the researchers created the MARS2 challenge, including two new datasets called Lens and AdsQA. Lens tests general reasoning skills across everyday situations, while AdsQA focuses on understanding advertisement videos. They then invited teams to compete in three different tasks – identifying objects in images, answering questions about images with spatial awareness, and reasoning about creative ads. Over 40 teams submitted their solutions, and the datasets, code, and results are all publicly available.

Why it matters?

This work is important because it provides a standardized way to evaluate and compare multimodal AI models. By focusing on realistic scenarios and making the resources publicly available, it encourages further research and development in this area, ultimately leading to more capable and useful AI systems that can understand the world around us more like humans do.

Abstract

This paper reviews the MARS2 2025 Challenge on Multimodal Reasoning. We aim to bring together different approaches in multimodal machine learning and LLMs via a large benchmark. We hope it better allows researchers to follow the state-of-the-art in this very dynamic area. Meanwhile, a growing number of testbeds have boosted the evolution of general-purpose large language models. Thus, this year's MARS2 focuses on real-world and specialized scenarios to broaden the multimodal reasoning applications of MLLMs. Our organizing team released two tailored datasets Lens and AdsQA as test sets, which support general reasoning in 12 daily scenarios and domain-specific reasoning in advertisement videos, respectively. We evaluated 40+ baselines that include both generalist MLLMs and task-specific models, and opened up three competition tracks, i.e., Visual Grounding in Real-world Scenarios (VG-RS), Visual Question Answering with Spatial Awareness (VQA-SA), and Visual Reasoning in Creative Advertisement Videos (VR-Ads). Finally, 76 teams from the renowned academic and industrial institutions have registered and 40+ valid submissions (out of 1200+) have been included in our ranking lists. Our datasets, code sets (40+ baselines and 15+ participants' methods), and rankings are publicly available on the MARS2 workshop website and our GitHub organization page https://github.com/mars2workshop/, where our updates and announcements of upcoming events will be continuously provided.

View Paper