Learning to Discover Regulatory Elements for Gene Expression Prediction
Xingyu Su, Haiyang Yu, Degui Zhi, Shuiwang Ji
2025-02-24
Summary
This paper talks about Seq2Exp, a new AI system that can predict how active genes will be by looking at DNA sequences and finding important control regions called regulatory elements.
What's the problem?
Scientists have a hard time figuring out which parts of DNA control gene activity. This makes it difficult to predict how active a gene will be just by looking at its DNA sequence. Current methods aren't very accurate at finding these control regions or predicting gene activity.
What's the solution?
The researchers created Seq2Exp, an AI that can find regulatory elements in DNA and use them to predict gene activity. It looks at both the DNA sequence and other signals in the cell to figure out which parts are important for controlling genes. Seq2Exp uses a special technique to focus on the most important information and ignore the rest, making its predictions more accurate.
Why it matters?
This matters because understanding how genes are controlled is crucial for biology and medicine. Better predictions of gene activity could help scientists understand diseases, develop new treatments, and even design custom genes. Seq2Exp's ability to find important DNA regions more accurately than other methods could speed up research and lead to new discoveries about how our genes work.
Abstract
We consider the problem of predicting gene expressions from DNA sequences. A key challenge of this task is to find the regulatory elements that control gene expressions. Here, we introduce Seq2Exp, a Sequence to Expression network explicitly designed to discover and extract regulatory elements that drive target gene expression, enhancing the accuracy of the gene expression prediction. Our approach captures the causal relationship between epigenomic signals, DNA sequences and their associated regulatory elements. Specifically, we propose to decompose the epigenomic signals and the DNA sequence conditioned on the causal active regulatory elements, and apply an information bottleneck with the Beta distribution to combine their effects while filtering out non-causal components. Our experiments demonstrate that Seq2Exp outperforms existing baselines in gene expression prediction tasks and discovers influential regions compared to commonly used statistical methods for peak detection such as MACS3. The source code is released as part of the AIRS library (https://github.com/divelab/AIRS/).