Energy Efficient Protein Language Models: Leveraging Small Language Models with LoRA for Controllable Protein Generation

Aayush Shah, Shankar Jayaratnam

2024-11-12

Energy Efficient Protein Language Models: Leveraging Small Language Models with LoRA for Controllable Protein Generation

Summary

This paper discusses how to create energy-efficient protein language models using smaller models and a technique called LoRA to generate proteins with specific properties.

What's the problem?

Large language models (LLMs) are great for many tasks, but they often require a lot of computational power and resources, especially when generating proteins. Most existing protein models are large and specialized for specific tasks, making them less efficient and harder to use for broader applications. This limits their effectiveness in fields like drug development and protein engineering, where quick and accurate generation of protein sequences is crucial.

What's the solution?

The authors introduce two smaller protein language models based on the Llama-3-8B and Phi-3-mini architectures. These models can generate proteins in two ways: uncontrollable generation, where the model creates proteins without specific instructions, and controllable generation, where users can specify desired properties of the proteins. They use the Low-Rank Adaptor (LoRA) technique to significantly reduce the number of parameters that need to be trained (to just 4% of the original size), which lowers the computational requirements and training time by 70%. The smaller Phi-3-mini model also reduces training costs by 30% compared to larger models while still achieving high performance in generating proteins.

Why it matters?

This research is important because it shows that smaller models can effectively generate proteins just as well as larger ones, making them more accessible for researchers and developers. By reducing energy consumption and training costs, these models can help advance protein engineering techniques, leading to faster discoveries in medicine and biotechnology. This could ultimately result in more effective drugs and treatments being developed more quickly.

Abstract

Large language models (LLMs) have demonstrated significant success in natural language processing (NLP) tasks and have shown promising results in other domains such as protein sequence generation. However, there remain salient differences between LLMs used for NLP, which effectively handle multiple tasks and are available in small sizes, and protein language models that are often specialized for specific tasks and only exist in larger sizes. In this work, we introduce two small protein language models, based on Llama-3-8B and Phi-3-mini, that are capable of both uncontrollable and controllable protein generation. For the uncontrollable generation task, our best model achieves an average pLDDT score of 69.75, demonstrating robust performance in generating viable protein structures. For the controllable generation task, in which the model generates proteins according to properties specified in the prompt, we achieve a remarkable average TM-Score of 0.84, indicating high structural similarity to target proteins. We chose 10 properties, including six classes of enzymes, to extend the capabilities of prior protein language models. Our approach utilizes the Low-Rank Adaptor (LoRA) technique, reducing trainable parameters to just 4% of the original model size, lowering computational requirements. By using a subset of the UniRef50 dataset and small models, we reduced the overall training time by 70% without compromising performance. Notably, Phi-3-mini reduced trainable parameters by 60%, decreasing training cost by 30% compared to Llama 3. Consequently, Phi-3 achieved a comparable TM-Score of 0.81, demonstrating that smaller models can match the performance of larger ones, like Llama 3. We also demonstrate the deployment of our models on the energy efficient ET-SoC-1 chip, significantly improving the TPS/W by a factor of 3.

View Paper