Small Models, Big Impact: Efficient Corpus and Graph-Based Adaptation of Small Multilingual Language Models for Low-Resource Languages
Daniil Gurgurov, Ivan Vykopal, Josef van Genabith, Simon Ostermann
2025-02-17
Summary
This paper talks about how to make AI language models work better for languages that don't have a lot of digital text available. It focuses on using smaller, more efficient models and clever ways to adapt them instead of relying on huge AI systems.
What's the problem?
Many languages around the world don't have enough digital text for big AI models to learn from. This makes it hard for these languages to benefit from modern language technology, which can lead to inequality in access to information and services.
What's the solution?
The researchers tried different ways to adapt smaller multilingual AI models to work well with low-resource languages. They used techniques called adapters, which are like add-ons that help the AI learn new languages without changing the whole model. They tested these adapters with both regular text and structured knowledge from databases, finding that even small amounts of data could improve how well the AI understood and worked with these languages.
Why it matters?
This matters because it shows we can make AI language tools work for more languages without needing massive amounts of data or super powerful computers. It could help bring useful language technology to more people around the world, even if their language isn't widely used online. This approach could make things like translation, voice assistants, and text analysis more accessible to speakers of less common languages.
Abstract
Low-resource languages (LRLs) face significant challenges in natural language processing (NLP) due to limited data. While current state-of-the-art large language models (LLMs) still struggle with LRLs, smaller multilingual models (mLMs) such as mBERT and XLM-R offer greater promise due to a better fit of their capacity to low training data sizes. This study systematically investigates parameter-efficient adapter-based methods for adapting mLMs to LRLs, evaluating three architectures: Sequential Bottleneck, Invertible Bottleneck, and Low-Rank Adaptation. Using unstructured text from GlotCC and structured knowledge from ConceptNet, we show that small adaptation datasets (e.g., up to 1 GB of free-text or a few MB of knowledge graph data) yield gains in intrinsic (masked language modeling) and extrinsic tasks (topic classification, sentiment analysis, and named entity recognition). We find that Sequential Bottleneck adapters excel in language modeling, while Invertible Bottleneck adapters slightly outperform other methods on downstream tasks due to better embedding alignment and larger parameter counts. Adapter-based methods match or outperform full fine-tuning while using far fewer parameters, and smaller mLMs prove more effective for LRLs than massive LLMs like LLaMA-3, GPT-4, and DeepSeek-R1-based distilled models. While adaptation improves performance, pre-training data size remains the dominant factor, especially for languages with extensive pre-training coverage.