ComProScanner: A multi-agent based framework for composition-property structured data extraction from scientific literature
Aritra Roy, Enrico Grisan, John Buckeridge, Chiara Gattinoni
2025-10-24
Summary
This paper introduces ComProScanner, a new tool designed to automatically pull out specific information – chemical compositions and properties – from scientific papers, specifically focusing on materials science.
What's the problem?
While powerful language models now exist, it's still difficult for researchers to easily build organized datasets from the huge amount of information locked away in scientific publications. Existing tools aren't user-friendly or don't handle complex data well, and there's a particular shortage of good datasets for certain materials like ceramic piezoelectrics, hindering the development of new machine learning models.
What's the solution?
The researchers created ComProScanner, which uses multiple 'agents' powered by different large language models to find, check, categorize, and display data from journal articles. They tested it on 100 articles, using ten different language models to extract information about ceramic piezoelectric materials and their piezoelectric properties. They found that DeepSeek-V3-0324 performed the best, achieving an accuracy of 82%.
Why it matters?
ComProScanner offers a simple way for scientists to quickly create datasets from research papers, even if they aren't experts in programming or artificial intelligence. This makes it easier to train machine learning models and accelerate discoveries in materials science and other fields where data extraction from literature is a bottleneck.
Abstract
Since the advent of various pre-trained large language models, extracting structured knowledge from scientific text has experienced a revolutionary change compared with traditional machine learning or natural language processing techniques. Despite these advances, accessible automated tools that allow users to construct, validate, and visualise datasets from scientific literature extraction remain scarce. We therefore developed ComProScanner, an autonomous multi-agent platform that facilitates the extraction, validation, classification, and visualisation of machine-readable chemical compositions and properties, integrated with synthesis data from journal articles for comprehensive database creation. We evaluated our framework using 100 journal articles against 10 different LLMs, including both open-source and proprietary models, to extract highly complex compositions associated with ceramic piezoelectric materials and corresponding piezoelectric strain coefficients (d33), motivated by the lack of a large dataset for such materials. DeepSeek-V3-0324 outperformed all models with a significant overall accuracy of 0.82. This framework provides a simple, user-friendly, readily-usable package for extracting highly complex experimental data buried in the literature to build machine learning or deep learning datasets.