Multilingual Encoder Knows more than You Realize: Shared Weights Pretraining for Extremely Low-Resource Languages

Zeli Su, Ziyin Zhang, Guixian Xu, Jianing Liu, XU Han, Ting Zhang, Yushuang Dong

2025-02-19

Multilingual Encoder Knows more than You Realize: Shared Weights
Pretraining for Extremely Low-Resource Languages

Summary

This paper talks about a new way to make AI language models work better with languages that don't have a lot of written resources. It's like teaching a smart computer to understand and write in rare languages by using what it already knows about more common languages.

What's the problem?

Current AI language models struggle with languages that don't have much written material available. This is especially true for the newest, most advanced AI models, which can work with even fewer languages than older models. As a result, many languages in the world don't have any AI tools that can generate text in their language.

What's the solution?

The researchers created a clever system that takes an AI model that understands many languages and teaches it to generate text in rare languages. They do this by reusing parts of the AI that understand language to also generate language, which helps the AI learn these new languages more efficiently. They tested their idea on four minority languages in China and created a model called XLM-SWCM.

Why it matters?

This matters because it could help preserve and support languages that are at risk of being left behind in the digital age. By making AI tools that can work with rare languages, we can help communities keep their languages alive and make information more accessible in these languages. It also shows that we can create powerful AI tools for less common languages without needing huge amounts of data or computing power, which could make language technology more inclusive and diverse.

Abstract

While <PRE_TAG>multilingual language models</POST_TAG> like <PRE_TAG>XLM-R</POST_TAG> have advanced multilingualism in <PRE_TAG>NLP</POST_TAG>, they still perform poorly in extremely low-resource languages. This situation is exacerbated by the fact that modern <PRE_TAG>LLMs</POST_TAG> such as <PRE_TAG>LLaMA</POST_TAG> and <PRE_TAG>Qwen</POST_TAG> support far fewer languages than <PRE_TAG>XLM-R</POST_TAG>, making <PRE_TAG>text generation models</POST_TAG> non-existent for many languages in the world. To tackle this challenge, we propose a novel framework for adapting <PRE_TAG>multilingual encoders</POST_TAG> to text generation in extremely low-resource languages. By reusing the weights between the encoder and the decoder, our framework allows the model to leverage the learned semantic space of the encoder, enabling <PRE_TAG>efficient learning</POST_TAG> and effective generalization in low-resource languages. Applying this framework to four Chinese minority languages, we present <PRE_TAG>XLM-SWCM</POST_TAG>, and demonstrate its superior performance on various downstream tasks even when compared with much larger models.

View Paper