TokSuite: Measuring the Impact of Tokenizer Choice on Language Model Behavior

Gül Sena Altıntaş, Malikeh Ehghaghi, Brian Lester, Fengyuan Liu, Wanru Zhao, Marco Ciccone, Colin Raffel

2025-12-25

TokSuite: Measuring the Impact of Tokenizer Choice on Language Model Behavior

Summary

This paper introduces TokSuite, a new set of tools and tests designed to help researchers understand how the way text is broken down into pieces, called tokenization, affects how well language models work.

What's the problem?

Language models rely on tokenizers to process text, but it's been difficult to figure out *how much* of a model's success or failure is actually due to the tokenizer itself, and not just the model's overall design or the data it was trained on. Researchers needed a way to isolate and study the impact of tokenization.

What's the solution?

The researchers created TokSuite, which includes fourteen language models that are identical except for the tokenizer they use. They also built a new set of tests that challenge the models with realistic, slightly messed-up text to see how different tokenizers handle those situations. By comparing the performance of these models, they could pinpoint the strengths and weaknesses of various tokenization methods.

Why it matters?

Understanding how tokenizers work is crucial for building better language models. TokSuite provides a way to systematically study tokenization, which can lead to improvements in model performance, especially when dealing with real-world text that isn't perfectly clean and formatted.

Abstract

Tokenizers provide the fundamental basis through which text is represented and processed by language models (LMs). Despite the importance of tokenization, its role in LM performance and behavior is poorly understood due to the challenge of measuring the impact of tokenization in isolation. To address this need, we present TokSuite, a collection of models and a benchmark that supports research into tokenization's influence on LMs. Specifically, we train fourteen models that use different tokenizers but are otherwise identical using the same architecture, dataset, training budget, and initialization. Additionally, we curate and release a new benchmark that specifically measures model performance subject to real-world perturbations that are likely to influence tokenization. Together, TokSuite allows robust decoupling of the influence of a model's tokenizer, supporting a series of novel findings that elucidate the respective benefits and shortcomings of a wide range of popular tokenizers.

View Paper