Scaling LLM Pre-training with Vocabulary Curriculum

Fangyuan Yu

2025-02-26

Scaling LLM Pre-training with Vocabulary Curriculum

Summary

This paper talks about a new way to teach AI language models words more efficiently, called vocabulary curriculum learning, which mimics how humans learn language

What's the problem?

Current AI language models learn all their words at once before they start training, which is different from how people learn language. This fixed approach makes it harder for AI to understand language at different levels and might not be the most efficient way to learn

What's the solution?

The researchers created a method that teaches AI new words gradually as it learns, just like how people learn language. Their approach uses something called entropy to decide when to add new words, and it helps the AI focus on harder words when needed. They tested this method on small AI models and found that it worked better than the old way of teaching all words at once

Why it matters?

This matters because it could make AI language models learn faster and better, which means we could create smarter AI assistants more quickly and with less computing power. It also helps AI understand language more like humans do, which could lead to AI that's better at communicating with people in natural ways

Abstract

Modern language models rely on static vocabularies, fixed before pretraining, in contrast to the adaptive vocabulary acquisition observed in human language learning. To bridge this gap, we introduce vocabulary curriculum learning, an approach that improves pretraining efficiency with log-linear scaling gains relative to vocabulary size. Our method alternates between entropy-guided vocabulary expansion and model optimization, enabling models to learn transferable representations across diverse tokenization granularities. This approach naturally gives rise to an optimal computation allocation pattern: longer tokens capture predictable content, while shorter tokens focus on more complex, harder-to-predict contexts. Experiments on small-scale GPT models demonstrate improved scaling efficiency, reinforcing the effectiveness of dynamic tokenization. We release our code to support further research and plan to extend our experiments to larger models and diverse domains.

View Paper