Towards Data-Efficient Pretraining for Atomic Property Prediction
Yasir Ghunaim, Hasan Abed Al Kader Hammoud, Bernard Ghanem
2025-02-18
Summary
This paper talks about a new way to train AI models for predicting atomic properties more efficiently by focusing on the quality of training data rather than just using a lot of it.
What's the problem?
Current methods for training AI models in atomic property prediction rely on massive datasets and a lot of computational power. However, this approach is expensive and inefficient, and adding more data doesn't always improve the model's performance, especially when the extra data isn't closely related to the task.
What's the solution?
The researchers developed a system that selects smaller, more relevant datasets using a new metric called the Chemical Similarity Index (CSI). This metric helps identify which datasets are most aligned with the task at hand. By training models on these carefully chosen datasets, they achieved better or equal performance compared to models trained on much larger datasets, while using only 1/24th of the computational resources. They also showed that adding irrelevant data can actually make performance worse.
Why it matters?
This research matters because it shows that focusing on high-quality, task-specific data can save time and resources while still improving AI performance. This approach could make AI tools for chemistry and materials science more accessible and efficient, helping scientists make discoveries faster without needing massive computational power.
Abstract
This paper challenges the recent paradigm in atomic property prediction that links progress to growing dataset sizes and computational resources. We show that pretraining on a carefully selected, task-relevant dataset can match or even surpass large-scale pretraining, while using as little as 1/24th of the computational cost. We introduce the Chemical Similarity Index (CSI), a novel metric inspired by computer vision's Fr\'echet Inception Distance, for molecular graphs which quantifies the alignment between upstream pretraining datasets and downstream tasks. By selecting the most relevant dataset with minimal CSI distance, we show that models pretrained on a smaller, focused dataset consistently outperform those pretrained on massive, mixed datasets such as JMP, even when those larger datasets include the relevant dataset. Counterintuitively, we also find that indiscriminately adding more data can degrade model performance when the additional data poorly aligns with the task at hand. Our findings highlight that quality often outperforms quantity in pretraining for atomic property prediction.