Counting Ability of Large Language Models and Impact of Tokenization
Xiang Zhang, Juntai Cao, Chenyu You
2024-10-28

Summary
This paper explores how the design of language models, specifically their ability to count and reason, is affected by the way input data is tokenized.
What's the problem?
Large language models (LLMs) like Transformers have limitations in their reasoning abilities due to their architecture. They struggle with tasks that require deep reasoning, such as counting, especially when the input length increases. Additionally, the way data is broken down into smaller pieces (tokenization) can impact how well these models perform on counting tasks. Current methods of tokenization may not be optimal for reasoning tasks, leading to errors and misunderstandings in the model's output.
What's the solution?
The authors investigate how different tokenization methods affect the counting abilities of LLMs. They analyze both theoretical aspects and conduct experiments to see how these tokenization choices lead to variations in performance. They find that using character-level tokenization can help improve counting capabilities compared to byte-level tokenization commonly used in LLMs. Their research also highlights the importance of selecting the right tokenization strategy to enhance reasoning and counting tasks in language models.
Why it matters?
This research is significant because it sheds light on a crucial aspect of language model performance: tokenization. By understanding how different ways of breaking down input data can impact reasoning abilities, researchers can develop better models that are more accurate and efficient at tasks that require counting and complex reasoning. This could lead to improvements in various applications, from natural language processing to AI systems that need to understand numerical information.
Abstract
Transformers, the backbone of modern large language models (LLMs), face inherent architectural limitations that impede their reasoning capabilities. Unlike recurrent networks, Transformers lack recurrent connections, confining them to constant-depth computation. This restriction places them in the complexity class TC^0, making them theoretically incapable of solving tasks that demand increasingly deep reasoning as input length grows. Counting, a fundamental component of many reasoning tasks, also requires reasoning depth to grow linearly to be performed inductively. While previous studies have established the upper limits of counting ability in Transformer-based expert models (i.e., models specifically trained for counting tasks), these findings do not directly extend to general-purpose LLMs due to differences in reasoning mechanisms. Recent work has highlighted how Chain of Thought (CoT) reasoning can help alleviate some of the architectural limitations of Transformers in counting tasks. However, little attention has been paid to the role of tokenization in these models. Unlike expert models that often use character-level tokenization, LLMs typically rely on byte-level (BPE) tokenizers, which fundamentally alters the way reasoning is processed. Our work investigates the impact of tokenization on the counting abilities of LLMs, uncovering substantial performance variations based on input tokenization differences. We provide both theoretical and experimental analyses, offering insights into how tokenization choices can undermine models' theoretical computability, thereby inspiring the design of new tokenization methods to enhance reasoning in LLMs.