MicroVQA++: High-Quality Microscopy Reasoning Dataset with Weakly Supervised Graphs for Multimodal Large Language Model

Manyu Li, Ruian He, Chenxi Ma, Weimin Tan, Bo Yan

2025-11-18

MicroVQA++: High-Quality Microscopy Reasoning Dataset with Weakly Supervised Graphs for Multimodal Large Language Model

Summary

This paper introduces a new, large dataset called MicroVQA++ designed to help AI models better understand and reason about images from microscopes, a crucial area in biomedical research.

What's the problem?

Current AI models, specifically large language models that can process both text and images, struggle with complex reasoning tasks when analyzing microscope images. This is largely because there isn't enough high-quality, labeled data available to train these models effectively. Existing datasets are often too small or don't have enough reliable information to teach the AI how to truly *understand* what it's seeing.

What's the solution?

The researchers created MicroVQA++ in three steps. First, they started with existing, reliable descriptions of images from scientific papers. Then, they used a clever system called HiCQA-Graph which is like a network connecting images, descriptions, and questions, to automatically check for inconsistencies and filter out bad data. Finally, they had an AI model generate multiple-choice questions about the images, which were then reviewed by humans to ensure quality. This resulted in a large, carefully checked dataset.

Why it matters?

This work is important because it provides a much-needed resource for training AI models to perform scientific reasoning on microscope images. By creating a high-quality dataset and a method for verifying data consistency, the researchers have shown that even relatively smaller AI models can achieve impressive results, even matching the performance of some of the most advanced, but closed-source, models. This could lead to faster discoveries in biology and medicine by helping researchers analyze images more efficiently and accurately.

Abstract

Multimodal Large Language Models are increasingly applied to biomedical imaging, yet scientific reasoning for microscopy remains limited by the scarcity of large-scale, high-quality training data. We introduce MicroVQA++, a three-stage, large-scale and high-quality microscopy VQA corpus derived from the BIOMEDICA archive. Stage one bootstraps supervision from expert-validated figure-caption pairs sourced from peer-reviewed articles. Stage two applies HiCQA-Graph, a novel heterogeneous graph over images, captions, and QAs that fuses NLI-based textual entailment, CLIP-based vision-language alignment, and agent signals to identify and filter inconsistent samples. Stage three uses a MultiModal Large Language Model (MLLM) agent to generate multiple-choice questions (MCQ) followed by human screening. The resulting release comprises a large training split and a human-checked test split whose Bloom's level hard-sample distribution exceeds the MicroVQA benchmark. Our work delivers (i) a quality-controlled dataset that couples expert literature with graph-based filtering and human refinement; (ii) HiCQA-Graph, the first graph that jointly models (image, caption, QA) for cross-modal consistency filtering; (iii) evidence that careful data construction enables 4B-scale MLLMs to reach competitive microscopy reasoning performance (e.g., GPT-5) and achieve state-of-the-art performance among open-source MLLMs. Code and dataset will be released after the review process concludes.

View Paper