Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling

Hongzhi Huang, Defa Zhu, Banggu Wu, Yutao Zeng, Ya Wang, Qiyang Min, Xun Zhou

2025-01-29

Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling

Summary

This paper talks about a new way to improve AI language models called Over-Tokenized Transformers. It focuses on how words are broken down into smaller pieces (tokens) before the AI processes them, and shows that using more complex tokens for input can make AI models work better without making them bigger.

What's the problem?

Current AI language models use the same method to break down words for both input and output, which isn't always the best approach. Smaller models struggle with very detailed word breakdowns, while larger models don't fully benefit from simpler ones. It's like trying to use the same textbook for both beginners and advanced students - it doesn't work well for either group.

What's the solution?

The researchers created Over-Tokenized Transformers, which use different methods for breaking down words when the AI reads them (input) versus when it writes them (output). For input, they use more complex tokens that can represent longer word chunks. They found that using bigger input vocabularies (more ways to break down words) consistently improved how well the AI performed, no matter its size. This approach allowed smaller AI models to perform as well as models twice their size.

Why it matters?

This matters because it shows a new way to make AI language models better without just making them bigger, which can be expensive and energy-intensive. By changing how words are broken down for the AI to read, we can create more efficient and powerful language models. This could lead to better AI assistants, more accurate translation tools, and improved language understanding in various applications, all while using less computing power. It's like finding a way to make cars go faster and use less fuel at the same time.

Abstract

Tokenization is a fundamental component of large language models (LLMs), yet its influence on model scaling and performance is not fully explored. In this paper, we introduce Over-Tokenized Transformers, a novel framework that decouples input and output vocabularies to improve language modeling performance. Specifically, our approach scales up input vocabularies to leverage multi-gram tokens. Through extensive experiments, we uncover a log-linear relationship between input vocabulary size and training loss, demonstrating that larger input vocabularies consistently enhance model performance, regardless of model size. Using a large input vocabulary, we achieve performance comparable to double-sized baselines with no additional cost. Our findings highlight the importance of tokenization in scaling laws and provide practical insight for tokenizer design, paving the way for more efficient and powerful LLMs.

View Paper