Adam's Law: Textual Frequency Law on Large Language Models
Hongyuan Adam Lu, Z. L., Victor Wei, Zefan Zhang, Zhao Hong, Qiqi Xiang, Bowen Cao, Wai Lam
2026-04-07
Summary
This paper explores how often words and phrases appear in text and how that relates to making Large Language Models, or LLMs, work better. It proposes that LLMs perform better when they're exposed to more commonly used language, both when given instructions and when being trained.
What's the problem?
LLMs are powerful, but we don't fully understand how the frequency of words and phrases in their training data affects their performance. Most LLMs are trained on secret datasets, so it's hard to know what language they've seen and how often. This makes it difficult to improve them systematically, because we can't easily control the language they learn from.
What's the solution?
The researchers developed a three-part system. First, they suggested a 'Textual Frequency Law' – the idea that LLMs should prefer common language. Since training data is hidden, they figured out a way to estimate how often sentences appear online. Then, they used a tool to rewrite input questions or prompts into more common phrasing. Finally, they trained LLMs by gradually increasing the complexity of the language they were exposed to, starting with the most frequent sentences and moving to less common ones. They tested this on a dataset they created, covering tasks like math, translation, and reasoning.
Why it matters?
This research is important because it suggests a new way to improve LLMs without needing access to their original training data. By focusing on language frequency, we can potentially make these models more reliable and effective, even as they become more complex. It provides a practical approach to optimizing LLM performance using readily available resources and a clever training strategy.
Abstract
While textual frequency has been validated as relevant to human cognition in reading speed, its relatedness to Large Language Models (LLMs) is seldom studied. We propose a novel research direction in terms of textual data frequency, which is an understudied topic, to the best of our knowledge. Our framework is composed of three units. First, this paper proposes Textual Frequency Law (TFL), which indicates that frequent textual data should be preferred for LLMs for both prompting and fine-tuning. Since many LLMs are closed-source in their training data, we propose using online resources to estimate the sentence-level frequency. We then utilize an input paraphraser to paraphrase the input into a more frequent textual expression. Next, we propose Textual Frequency Distillation (TFD) by querying LLMs to conduct story completion by further extending the sentences in the datasets, and the resulting corpora are used to adjust the initial estimation. Finally, we propose Curriculum Textual Frequency Training (CTFT) that fine-tunes LLMs in an increasing order of sentence-level frequency. Experiments are conducted on our curated dataset Textual Frequency Paired Dataset (TFPD) on math reasoning, machine translation, commonsense reasoning and agentic tool calling. Results show the effectiveness of our framework.