Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies

Chaofan Tao, Qian Liu, Longxu Dou, Niklas Muennighoff, Zhongwei Wan, Ping Luo, Min Lin, Ngai Wong

2024-07-19

Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies

Summary

This paper discusses how the size of the vocabulary used in large language models (LLMs) affects their performance. It argues that larger models should use larger vocabularies to improve efficiency and effectiveness.

What's the problem?

Most research on LLMs has focused on the number of parameters and the amount of training data, ignoring how vocabulary size impacts performance. Many existing LLMs use vocabularies that are too small, which can limit their ability to understand and generate language effectively.

What's the solution?

The authors propose a new method called MultiTrust to determine the optimal vocabulary size for LLMs based on their compute budget. They tested models with different vocabulary sizes and found that increasing the vocabulary size significantly improved performance. For example, they suggested that the Llama2-70B model should have a vocabulary size of at least 216K instead of its current 32K. They validated their predictions through experiments, showing that using a larger vocabulary consistently leads to better results.

Why it matters?

This research highlights the critical role of vocabulary size in enhancing the capabilities of LLMs. By optimizing vocabulary size, developers can create more efficient and powerful language models, which is essential as these technologies become more integrated into various applications.

Abstract

Research on scaling large language models (LLMs) has primarily focused on model parameters and training data size, overlooking the role of vocabulary size. % Intuitively, larger vocabularies enable more efficient tokenization by representing sentences with fewer tokens, but they also increase the risk of under-fitting representations for rare tokens. We investigate how vocabulary size impacts LLM scaling laws by training models ranging from 33M to 3B parameters on up to 500B characters with various vocabulary configurations. We propose three complementary approaches for predicting the compute-optimal vocabulary size: IsoFLOPs analysis, derivative estimation, and parametric fit of the loss function. Our approaches converge on the same result that the optimal vocabulary size depends on the available compute budget and that larger models deserve larger vocabularies. However, most LLMs use too small vocabulary sizes. For example, we predict that the optimal vocabulary size of Llama2-70B should have been at least 216K, 7 times larger than its vocabulary of 32K. We validate our predictions empirically by training models with 3B parameters across different FLOPs budgets. Adopting our predicted optimal vocabulary size consistently improves downstream performance over commonly used vocabulary sizes. By increasing the vocabulary size from the conventional 32K to 43K, we improve performance on ARC-Challenge from 29.1 to 32.0 with the same 2.3e21 FLOPs. Our work emphasizes the necessity of jointly considering model parameters and vocabulary size for efficient scaling.

View Paper