Evaluating Tokenizer Performance of Large Language Models Across Official Indian Languages
S. Tamang, D. J. Bora
2024-11-20

Summary
This paper evaluates how well different tokenizers work with large language models (LLMs) specifically for all 22 official languages of India, focusing on their efficiency in processing these languages.
What's the problem?
While large language models have made significant advancements, there is a lack of effective methods to analyze how well they handle multiple languages, especially in the context of Indian languages. Tokenization, which is the process of breaking down text into smaller parts that models can understand, is critical for optimizing performance in these multilingual settings. However, not all tokenizers perform equally well across different languages.
What's the solution?
The authors conducted a comprehensive evaluation of tokenizers used by 12 different LLMs across all 22 official Indian languages. They used a metric called Normalized Sequence Length (NSL) to measure the efficiency of each tokenizer. Their findings showed that the SUTRA tokenizer outperformed all others in 14 languages, including improvements seen in the newer GPT-4o model compared to its predecessor. The study also highlighted the limitations of some models like Project Indus in certain languages.
Why it matters?
This research is important because it sheds light on the effectiveness of tokenizers for multilingual models, particularly for underrepresented languages like those spoken in India. By identifying which tokenizers work best, this study can help improve language processing technologies, making them more accessible and efficient for diverse linguistic communities.
Abstract
Large Language Models (LLMs) based on transformer architectures have revolutionized a variety of domains, with tokenization playing a pivotal role in their pre-processing and fine-tuning stages. In multilingual models, particularly those tailored for Indic languages, effective tokenization is crucial for optimizing performance. This paper presents a comprehensive evaluation of tokenizers used by 12 LLMs across all 22 official languages of India, with a focus on comparing the efficiency of their tokenization processes. We employed the Normalized Sequence Length (NSL) as a key metric in our analysis. Our findings reveal that the SUTRA tokenizer outperforms all other models, including several Indic-specific models, excelling in 14 languages. Notable insights include the SUTRA tokenizer's superior handling of Indic languages, GPT-4o's advancement over its predecessor GPT-4 in processing Indian languages, and the limited performance of Project Indus in certain languages. This study underscores the critical importance of developing targeted tokenization strategies for multilingual and Indic-centric models, laying the groundwork for future improvements in tokenizer design to enhance linguistic coverage and model efficiency.