Genomic Next-Token Predictors are In-Context Learners
Nathan Breslow, Aayush Mishra, Mahler Revsine, Michael C. Schatz, Anqi Liu, Daniel Khashabi
2025-11-18
Summary
This research investigates whether 'in-context learning' – the ability of a computer program to learn from examples given right in the input – isn't just something that happens with language, but can also occur in other types of data like genetic code.
What's the problem?
Scientists have observed that large language models can learn 'in-context', meaning they can figure out a pattern from a few examples provided within the text itself. It was thought this ability might be unique to human language because of the way we structure sentences and ideas. The question this paper addresses is: is this learning ability actually a result of simply training a model to predict what comes next in *any* sequence, or is it something special about language?
What's the solution?
The researchers took a large model trained to predict the next building block (nucleotide) in a DNA sequence – similar to how language models predict the next word. They then gave this model tasks that required it to recognize patterns, both in DNA sequences and in a format similar to language tasks. By comparing how well the model performed with different numbers of examples, they showed that it learned in a similar way to language models, getting better as more examples were provided. This was done in a very controlled way to make sure the comparison was fair.
Why it matters?
This is important because it suggests that in-context learning isn't limited to language. It's likely a more general phenomenon that arises when a model is trained to predict patterns in any kind of sequential data. This means we might be able to apply the same learning techniques to areas like genetics, protein folding, or other fields dealing with complex sequences, and it supports the idea that there's a universal way these models learn regardless of the data type.
Abstract
In-context learning (ICL) -- the capacity of a model to infer and apply abstract patterns from examples provided within its input -- has been extensively studied in large language models trained for next-token prediction on human text. In fact, prior work often attributes this emergent behavior to distinctive statistical properties in human language. This raises a fundamental question: can ICL arise organically in other sequence domains purely through large-scale predictive training? To explore this, we turn to genomic sequences, an alternative symbolic domain rich in statistical structure. Specifically, we study the Evo2 genomic model, trained predominantly on next-nucleotide (A/T/C/G) prediction, at a scale comparable to mid-sized LLMs. We develop a controlled experimental framework comprising symbolic reasoning tasks instantiated in both linguistic and genomic forms, enabling direct comparison of ICL across genomic and linguistic models. Our results show that genomic models, like their linguistic counterparts, exhibit log-linear gains in pattern induction as the number of in-context demonstrations increases. To the best of our knowledge, this is the first evidence of organically emergent ICL in genomic sequences, supporting the hypothesis that ICL arises as a consequence of large-scale predictive modeling over rich data. These findings extend emergent meta-learning beyond language, pointing toward a unified, modality-agnostic view of in-context learning.