Tokenization Falling Short: The Curse of Tokenization
Yekun Chai, Yewei Fang, Qiwei Peng, Xuhong Li
2024-06-19

Summary
This paper discusses the challenges faced by language models due to a process called tokenization, which converts text into smaller pieces (tokens) for processing. The authors highlight how these challenges can affect the performance of large language models (LLMs) and propose ways to improve their robustness.
What's the problem?
Tokenization is a crucial step in preparing text for language models, but it has significant drawbacks. It can be sensitive to mistakes like typos or variations in text length, which means that if there are small errors in the input, the model might misunderstand or misinterpret the information. Additionally, current tokenization methods often ignore the internal structure of words and phrases, making it harder for models to understand the meaning behind the text accurately.
What's the solution?
The authors conducted a study to explore these issues through three main research questions: how tokenization affects complex problem-solving, how well models understand the structure of tokens, and how resilient they are to typographical errors. They found that while increasing the size of the model can help improve performance, LLMs still struggle with biases caused by typos and formatting differences. To address these problems, they suggest using techniques like subword regularization (such as BPE-dropout), which can help make tokenization more effective and reduce its negative impact on model performance.
Why it matters?
This research is important because it sheds light on a critical aspect of how language models work and highlights the need for better tokenization methods. By understanding and addressing the 'curse of tokenization,' researchers can develop more robust and reliable language models that perform better in real-world applications, such as chatbots, translation services, and other AI tools that rely on understanding human language.
Abstract
Language models typically tokenize raw text into sequences of subword identifiers from a predefined vocabulary, a process inherently sensitive to typographical errors, length variations, and largely oblivious to the internal structure of tokens-issues we term the curse of tokenization. In this study, we delve into these drawbacks and demonstrate that large language models (LLMs) remain susceptible to these problems. This study systematically investigates these challenges and their impact on LLMs through three critical research questions: (1) complex problem solving, (2) token structure probing, and (3) resilience to typographical variation. Our findings reveal that scaling model parameters can mitigate the issue of tokenization; however, LLMs still suffer from biases induced by typos and other text format variations. Our experiments show that subword regularization such as BPE-dropout can mitigate this issue. We will release our code and data to facilitate further research.