SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers
Shraman Pramanick, Rama Chellappa, Subhashini Venugopalan
2024-07-15

Summary
This paper introduces SPIQA, a new dataset designed for multimodal question answering specifically focused on scientific papers, allowing users to ask questions about both the text and the visual elements like figures and tables.
What's the problem?
Existing datasets for question answering based on scientific papers mainly focus on text and are limited in size and scope. This makes it difficult for AI systems to understand the full context of research articles, especially when important information is presented in figures and tables. Readers often struggle to find answers efficiently within long and complex documents.
What's the solution?
SPIQA addresses this issue by providing a large-scale dataset that includes 270,000 questions related to various visual elements found in scientific papers, such as plots, charts, and tables. The dataset was created using both automatic methods and human curation to ensure high quality. It includes questions that require understanding both the text and the visuals, making it a comprehensive resource for training AI systems to perform better in multimodal question answering tasks. The researchers also developed evaluation strategies that help assess how well these AI systems can reason through the information presented.
Why it matters?
This research is significant because it enhances the ability of AI to interact with scientific literature more effectively. By allowing AI systems to answer questions based on both text and visual data, SPIQA can help researchers, students, and the general public better understand complex scientific topics. This could lead to improved learning outcomes and more efficient research processes.
Abstract
Seeking answers to questions within long scientific research articles is a crucial area of study that aids readers in quickly addressing their inquiries. However, existing question-answering (QA) datasets based on scientific papers are limited in scale and focus solely on textual content. To address this limitation, we introduce SPIQA (Scientific Paper Image Question Answering), the first large-scale QA dataset specifically designed to interpret complex figures and tables within the context of scientific research articles across various domains of computer science. Leveraging the breadth of expertise and ability of multimodal large language models (MLLMs) to understand figures, we employ automatic and manual curation to create the dataset. We craft an information-seeking task involving multiple images that cover a wide variety of plots, charts, tables, schematic diagrams, and result visualizations. SPIQA comprises 270K questions divided into training, validation, and three different evaluation splits. Through extensive experiments with 12 prominent foundational models, we evaluate the ability of current multimodal systems to comprehend the nuanced aspects of research articles. Additionally, we propose a Chain-of-Thought (CoT) evaluation strategy with in-context retrieval that allows fine-grained, step-by-step assessment and improves model performance. We further explore the upper bounds of performance enhancement with additional textual information, highlighting its promising potential for future research and the dataset's impact on revolutionizing how we interact with scientific literature.