M3SciQA: A Multi-Modal Multi-Document Scientific QA Benchmark for Evaluating Foundation Models

Chuhan Li, Ziyao Shangguan, Yilun Zhao, Deyuan Li, Yixin Liu, Arman Cohan

2024-11-08

M3SciQA: A Multi-Modal Multi-Document Scientific QA Benchmark for Evaluating Foundation Models

Summary

This paper presents M3SciQA, a new benchmark designed to evaluate how well foundation models can answer questions using multiple documents and different types of data in scientific research.

What's the problem?

Most existing benchmarks for testing AI models focus on single documents and text-only tasks, which do not reflect the complexity of real scientific research. In actual research workflows, scientists often need to interpret information from various sources, including figures and tables, across multiple documents. This gap makes it hard to assess how well AI models can handle the diverse and interconnected information found in scientific literature.

What's the solution?

To address this issue, the researchers developed M3SciQA, which includes 1,452 expert-annotated questions related to 70 clusters of natural language processing papers. Each cluster consists of a main paper and all its cited documents. This setup mimics real-world research by requiring models to engage with both textual and visual information from multiple sources. The researchers evaluated 18 different foundation models using this benchmark to see how well they could retrieve and reason about information across these documents.

Why it matters?

This research is significant because it provides a more realistic way to test AI models that are used in scientific contexts. By focusing on multi-modal and multi-document tasks, M3SciQA helps improve the development of AI systems that can better assist researchers in analyzing complex scientific literature, ultimately leading to more effective tools for knowledge discovery.

Abstract

Existing benchmarks for evaluating foundation models mainly focus on single-document, text-only tasks. However, they often fail to fully capture the complexity of research workflows, which typically involve interpreting non-textual data and gathering information across multiple documents. To address this gap, we introduce M3SciQA, a multi-modal, multi-document scientific question answering benchmark designed for a more comprehensive evaluation of foundation models. M3SciQA consists of 1,452 expert-annotated questions spanning 70 natural language processing paper clusters, where each cluster represents a primary paper along with all its cited documents, mirroring the workflow of comprehending a single paper by requiring multi-modal and multi-document data. With M3SciQA, we conduct a comprehensive evaluation of 18 foundation models. Our results indicate that current foundation models still significantly underperform compared to human experts in multi-modal information retrieval and in reasoning across multiple scientific documents. Additionally, we explore the implications of these findings for the future advancement of applying foundation models in multi-modal scientific literature analysis.

View Paper