Gumbel-Softmax Flow Matching with Straight-Through Guidance for Controllable Biological Sequence Generation
Sophia Tang, Yinuo Zhang, Alexander Tong, Pranam Chatterjee
2025-03-26
Summary
This paper is about creating a new way to design DNA and protein sequences using AI, making it easier to control the properties of these sequences.
What's the problem?
Existing AI methods for designing biological sequences have limitations when it comes to creating complex sequences like proteins and controlling their specific functions.
What's the solution?
The researchers developed a new AI framework that uses a novel mathematical approach to generate high-quality and diverse sequences. This framework also allows for training-free guidance, enabling precise control over the desired properties of the generated sequences.
Why it matters?
This work matters because it can accelerate the design of new biological sequences with specific functions, which could have significant applications in areas like medicine and biotechnology, such as developing treatments for rare diseases.
Abstract
Flow matching in the continuous simplex has emerged as a promising strategy for DNA sequence design, but struggles to scale to higher simplex dimensions required for peptide and protein generation. We introduce Gumbel-Softmax Flow and Score Matching, a generative framework on the simplex based on a novel Gumbel-Softmax interpolant with a time-dependent temperature. Using this interpolant, we introduce Gumbel-Softmax Flow Matching by deriving a parameterized velocity field that transports from smooth categorical distributions to distributions concentrated at a single vertex of the simplex. We alternatively present Gumbel-Softmax Score Matching which learns to regress the gradient of the probability density. Our framework enables high-quality, diverse generation and scales efficiently to higher-dimensional simplices. To enable training-free guidance, we propose Straight-Through Guided Flows (STGFlow), a classifier-based guidance method that leverages straight-through estimators to steer the unconditional velocity field toward optimal vertices of the simplex. STGFlow enables efficient inference-time guidance using classifiers pre-trained on clean sequences, and can be used with any discrete flow method. Together, these components form a robust framework for controllable de novo sequence generation. We demonstrate state-of-the-art performance in conditional DNA promoter design, sequence-only protein generation, and target-binding peptide design for rare disease treatment.