From Macro to Micro: Benchmarking Microscopic Spatial Intelligence on Molecules via Vision-Language Models

Zongzhao Li, Xiangzhe Kong, Jiahui Su, Zongyang Ma, Mingze Li, Songyou Li, Yuelin Zhang, Yu Rong, Tingyang Xu, Deli Zhao, Wenbing Huang

2025-12-12

From Macro to Micro: Benchmarking Microscopic Spatial Intelligence on Molecules via Vision-Language Models

Summary

This paper explores whether artificial intelligence, specifically models that can understand both images and language, can 'understand' the spatial arrangements of tiny things like molecules – something humans do naturally when making scientific discoveries.

What's the problem?

Understanding how things are positioned in space is crucial for science, especially when dealing with things too small to see directly. Current AI models are really good at recognizing objects in pictures and answering questions about them, but it's unclear if they can apply that ability to the microscopic world and reason about the relationships between atoms and molecules. There wasn't a good way to *test* this ability systematically.

What's the solution?

The researchers created a large and challenging test, called MiSI-Bench, with over 163,000 questions and a huge number of images based on molecular structures. These questions test different levels of spatial understanding, from simple rotations to identifying complex connections like hydrogen bonds. They then tested existing AI models on this benchmark to see how well they performed, and also experimented with improving one model by 'training' it specifically on this type of data.

Why it matters?

This work is important because it highlights a gap in current AI capabilities – they struggle with the kind of spatial reasoning needed for scientific breakthroughs. While AI can be powerful, it needs more than just general image and language skills; it needs specific knowledge about science to truly assist in discovery. The results show that with some focused training, AI can even *exceed* human performance in certain spatial tasks, but still needs improvement in areas requiring deeper scientific understanding.

Abstract

This paper introduces the concept of Microscopic Spatial Intelligence (MiSI), the capability to perceive and reason about the spatial relationships of invisible microscopic entities, which is fundamental to scientific discovery. To assess the potential of Vision-Language Models (VLMs) in this domain, we propose a systematic benchmark framework MiSI-Bench. This framework features over 163,000 question-answer pairs and 587,000 images derived from approximately 4,000 molecular structures, covering nine complementary tasks that evaluate abilities ranging from elementary spatial transformations to complex relational identifications. Experimental results reveal that current state-of-the-art VLMs perform significantly below human level on this benchmark. However, a fine-tuned 7B model demonstrates substantial potential, even surpassing humans in spatial transformation tasks, while its poor performance in scientifically-grounded tasks like hydrogen bond recognition underscores the necessity of integrating explicit domain knowledge for progress toward scientific AGI. The datasets are available at https://huggingface.co/datasets/zongzhao/MiSI-bench.

View Paper