A Large Encoder-Decoder Family of Foundation Models For Chemical Language
Eduardo Soares, Victor Shirasuna, Emilio Vital Brazil, Renato Cerqueira, Dmitry Zubarev, Kristin Schmidt
2024-07-31

Summary
This paper discusses the development of a new family of large encoder-decoder models designed for understanding and generating chemical language. These models are trained to perform various tasks in chemistry, such as predicting properties of molecules and generating new molecular structures.
What's the problem?
Deep learning models often struggle with tasks in cheminformatics (the study of chemical data) due to a lack of sufficient training data specifically designed for chemical language. Most existing models rely heavily on labeled datasets, which are time-consuming and expensive to create. This limits their ability to generalize well to new, unseen chemical data.
What's the solution?
To address these challenges, the authors created a large encoder-decoder foundation model trained on a massive dataset of 91 million SMILES samples (a way to represent chemical structures using text). This dataset helps the model learn the relationships and patterns in chemical data without needing extensive labeled examples. The model can handle complex tasks like predicting quantum properties and can be fine-tuned for specific applications, making it versatile for different chemistry-related problems.
Why it matters?
This research is important because it advances the field of cheminformatics by providing a powerful tool for chemists and researchers. By improving the ability of models to understand and generate chemical language, this work can accelerate discoveries in drug development, materials science, and other areas that rely on chemical information. It also demonstrates the potential of using large-scale pre-training methods to enhance machine learning models in specialized fields.
Abstract
Large-scale pre-training methodologies for chemical language models represent a breakthrough in cheminformatics. These methods excel in tasks such as property prediction and molecule generation by learning contextualized representations of input tokens through self-supervised learning on large unlabeled corpora. Typically, this involves pre-training on unlabeled data followed by fine-tuning on specific tasks, reducing dependence on annotated datasets and broadening chemical language representation understanding. This paper introduces a large encoder-decoder chemical foundation models pre-trained on a curated dataset of 91 million SMILES samples sourced from PubChem, which is equivalent to 4 billion of molecular tokens. The proposed foundation model supports different complex tasks, including quantum property prediction, and offer flexibility with two main variants (289M and 8times289M). Our experiments across multiple benchmark datasets validate the capacity of the proposed model in providing state-of-the-art results for different tasks. We also provide a preliminary assessment of the compositionality of the embedding space as a prerequisite for the reasoning tasks. We demonstrate that the produced latent space is separable compared to the state-of-the-art with few-shot learning capabilities.