Trans-Tokenization and Cross-lingual Vocabulary Transfers: Language Adaptation of LLMs for Low-Resource NLP
François Remy, Pieter Delobelle, Hayastan Avetisyan, Alfiya Khabibullina, Miryam de Lhoneux, Thomas Demeester
2024-08-09

Summary
This paper discusses a new method called trans-tokenization, which helps large language models (LLMs) adapt to low-resource languages by transferring vocabulary from high-resource languages.
What's the problem?
Creating effective language models for languages that have limited training data is challenging. Many low and mid-resource languages lack enough high-quality data to train models effectively, making it hard for these languages to benefit from advancements in natural language processing (NLP). This means speakers of these languages miss out on the advantages of AI technologies.
What's the solution?
The authors propose a strategy called trans-tokenization, which involves adapting a well-trained LLM from a high-resource language to a new, low-resource language. They do this by using a weighted average of similar words from the high-resource language to create initial representations for the new language. They tested this approach using a series of models called Tweeties and found that it performed well on various language tasks. Additionally, they introduced Hydra LLMs, which can switch between different languages easily, allowing for better performance without needing extensive parallel data.
Why it matters?
This research is important because it opens up opportunities for developing effective AI tools in languages that currently lack resources. By making it easier to adapt models to new languages, this work can help empower speakers of low-resource languages and ensure they have access to modern technology and services.
Abstract
The development of monolingual language models for low and mid-resource languages continues to be hindered by the difficulty in sourcing high-quality training data. In this study, we present a novel cross-lingual vocabulary transfer strategy, trans-tokenization, designed to tackle this challenge and enable more efficient language adaptation. Our approach focuses on adapting a high-resource monolingual LLM to an unseen target language by initializing the token embeddings of the target language using a weighted average of semantically similar token embeddings from the source language. For this, we leverage a translation resource covering both the source and target languages. We validate our method with the Tweeties, a series of trans-tokenized LLMs, and demonstrate their competitive performance on various downstream tasks across a small but diverse set of languages. Additionally, we introduce Hydra LLMs, models with multiple swappable language modeling heads and embedding tables, which further extend the capabilities of our trans-tokenization strategy. By designing a Hydra LLM based on the multilingual model TowerInstruct, we developed a state-of-the-art machine translation model for Tatar, in a zero-shot manner, completely bypassing the need for high-quality parallel data. This breakthrough is particularly significant for low-resource languages like Tatar, where high-quality parallel data is hard to come by. By lowering the data and time requirements for training high-quality models, our trans-tokenization strategy allows for the development of LLMs for a wider range of languages, especially those with limited resources. We hope that our work will inspire further research and collaboration in the field of cross-lingual vocabulary transfer and contribute to the empowerment of languages on a global scale.