Facilitating large language model Russian adaptation with Learned Embedding Propagation
Mikhail Tikhomirov, Daniil Chernyshev
2024-12-31
Summary
This paper talks about Learned Embedding Propagation (LEP), a new method for adapting large language models (LLMs) to the Russian language without needing a lot of training data.
What's the problem?
Adapting LLMs to work well in specific languages like Russian can be expensive and time-consuming. Traditional methods often require a lot of data and computational resources, making it hard to create effective language-specific models. Additionally, many existing models do not share their training data, which limits the ability to replicate their success in other languages.
What's the solution?
To address these challenges, the authors propose LEP, which allows for the integration of new language knowledge into existing LLMs without extensive retraining. LEP uses a technique that minimizes the amount of new training data needed and skips the usual instruction-tuning step. Instead, it directly implants knowledge about the new language into the model. The authors tested this method on two LLMs and found that it performed comparably to traditional methods while requiring less data.
Why it matters?
This research is important because it makes it easier and cheaper to adapt powerful AI models for different languages, particularly Russian. By improving how LLMs can learn new languages efficiently, LEP can help expand access to advanced AI technologies in non-English speaking regions, supporting better communication and understanding across languages.
Abstract
Rapid advancements of large language model (LLM) technologies led to the introduction of powerful open-source instruction-tuned LLMs that have the same text generation quality as the state-of-the-art counterparts such as GPT-4. While the emergence of such models accelerates the adoption of LLM technologies in sensitive-information environments the authors of such models don not disclose the training data necessary for replication of the results thus making the achievements model-exclusive. Since those open-source models are also multilingual this in turn reduces the benefits of training a language specific LLMs as improved inference computation efficiency becomes the only guaranteed advantage of such costly procedure. More cost-efficient options such as vocabulary extension and subsequent continued pre-training are also inhibited by the lack of access to high-quality instruction-tuning data since it is the major factor behind the resulting LLM task-solving capabilities. To address the limitations and cut the costs of the language adaptation pipeline we propose Learned Embedding Propagation (LEP). Unlike existing approaches our method has lower training data size requirements due to minimal impact on existing LLM knowledge which we reinforce using novel ad-hoc embedding propagation procedure that allows to skip the instruction-tuning step and instead implant the new language knowledge directly into any existing instruct-tuned variant. We evaluated four Russian vocabulary adaptations for LLaMa-3-8B and Mistral-7B, showing that LEP is competitive with traditional instruction-tuning methods, achieving performance comparable to OpenChat 3.5 and LLaMa-3-8B-Instruct, with further improvements via self-calibration and continued tuning enhancing task-solving capabilities.