MergeDNA: Context-aware Genome Modeling with Dynamic Tokenization through Token Merging
Siyuan Li, Kai Yu, Anna Wang, Zicheng Liu, Chang Yu, Jingbo Zhou, Qirong Yang, Yucheng Guo, Xiaoming Zhang, Stan Z. Li
2025-11-24
Summary
This paper introduces a new method, called MergeDNA, for understanding and working with DNA sequences by improving how we represent the building blocks of DNA to computers.
What's the problem?
Analyzing DNA is hard because the information isn't evenly spread out – some parts are more important than others. Also, scientists haven't agreed on the best way to break down DNA into meaningful units for computers to process. Existing methods that treat DNA like simple text or use pre-defined 'words' struggle to handle this complexity and don't adapt well to different types of DNA sequences.
What's the solution?
MergeDNA uses a two-part system. First, it automatically figures out how to group DNA bases (A, T, C, G) into 'words' of varying lengths, focusing on important patterns. This is done by a 'tokenization module' that learns to merge adjacent bases. Second, it uses a 'Latent Encoder' to understand the bigger picture and relationships between these 'words'. It then trains itself using two tasks: one that learns to identify and filter out unimportant 'words', and another that learns to predict the important ones. This combined approach allows it to learn a dynamic vocabulary tailored to the specific DNA sequence.
Why it matters?
This research is important because it significantly improves the accuracy of analyzing DNA. MergeDNA outperforms existing methods on several tests, including predicting DNA function and understanding how different biological factors interact. This could lead to better understanding of diseases, development of new treatments, and advancements in fields like personalized medicine.
Abstract
Modeling genomic sequences faces two unsolved challenges: the information density varies widely across different regions, while there is no clearly defined minimum vocabulary unit. Relying on either four primitive bases or independently designed DNA tokenizers, existing approaches with naive masked language modeling pre-training often fail to adapt to the varying complexities of genomic sequences. Leveraging Token Merging techniques, this paper introduces a hierarchical architecture that jointly optimizes a dynamic genomic tokenizer and latent Transformers with context-aware pre-training tasks. As for network structures, the tokenization module automatically chunks adjacent bases into words by stacking multiple layers of the differentiable token merging blocks with local-window constraints, then a Latent Encoder captures the global context of these merged words by full-attention blocks. Symmetrically employing a Latent Decoder and a Local Decoder, MergeDNA learns with two pre-training tasks: Merged Token Reconstruction simultaneously trains the dynamic tokenization module and adaptively filters important tokens, while Adaptive Masked Token Modeling learns to predict these filtered tokens to capture informative contents. Extensive experiments show that MergeDNA achieves superior performance on three popular DNA benchmarks and several multi-omics tasks with fine-tuning or zero-shot evaluation, outperforming typical tokenization methods and large-scale DNA foundation models.