Learning Molecular Representation in a Cell
Gang Liu, Srijit Seal, John Arevalo, Zhenwen Liang, Anne E. Carpenter, Meng Jiang, Shantanu Singh
2024-06-24

Summary
This paper presents a new method called Information Alignment (InfoAlign) that helps scientists better understand how molecules interact within cells. It aims to improve predictions about how effective and safe drugs will be by providing a clearer view of how cells respond to different molecules.
What's the problem?
Current methods for learning about molecular interactions often fail to give a complete picture of how cells behave when exposed to various small molecules. They struggle to filter out irrelevant information (or noise), which makes it hard for models to generalize their findings and accurately predict drug effects in real-life situations.
What's the solution?
InfoAlign addresses this issue by using a technique called the information bottleneck method. It creates a 'context graph' where molecules and their responses are connected based on various scientific criteria. This method optimizes how information is represented, ensuring that only the most relevant details are kept while discarding unnecessary data. It also aligns the information with different features of the surrounding environment, making it easier to understand how molecules affect cell behavior. The paper shows that this approach outperforms existing methods in predicting molecular properties and matching molecules with their morphological effects on cells.
Why it matters?
This research is significant because it enhances our ability to predict how drugs will work in living organisms, which is crucial for developing safe and effective medications. By improving molecular representation learning, scientists can gain better insights into drug interactions, ultimately leading to advancements in drug discovery and personalized medicine.
Abstract
Predicting drug efficacy and safety in vivo requires information on biological responses (e.g., cell morphology and gene expression) to small molecule perturbations. However, current molecular representation learning methods do not provide a comprehensive view of cell states under these perturbations and struggle to remove noise, hindering model generalization. We introduce the Information Alignment (InfoAlign) approach to learn molecular representations through the information bottleneck method in cells. We integrate molecules and cellular response data as nodes into a context graph, connecting them with weighted edges based on chemical, biological, and computational criteria. For each molecule in a training batch, InfoAlign optimizes the encoder's latent representation with a minimality objective to discard redundant structural information. A sufficiency objective decodes the representation to align with different feature spaces from the molecule's neighborhood in the context graph. We demonstrate that the proposed sufficiency objective for alignment is tighter than existing encoder-based contrastive methods. Empirically, we validate representations from InfoAlign in two downstream tasks: molecular property prediction against up to 19 baseline methods across four datasets, plus zero-shot molecule-morphology matching.