Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement

Lingfeng Ming, Bo Zeng, Chenyang Lyu, Tianqi Shi, Yu Zhao, Xue Yang, Yefeng Liu, Yiyu Wang, Linlong Xu, Yangyang Liu, Xiaohu Zhao, Hao Wang, Heng Liu, Hao Zhou, Huifeng Yin, Zifu Shang, Haijun Li, Longyue Wang, Weihua Luo, Kaifu Zhang

2024-12-06

Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement

Summary

This paper talks about Marco-LLM, a new large language model that has been trained to work well with many different languages, especially focusing on those that are less commonly used.

What's the problem?

Most large language models perform really well in major languages like English but struggle with low-resource languages, which have fewer training materials available. This limits their usefulness for people who speak these languages and makes it hard to create accurate multilingual applications.

What's the solution?

To solve this problem, the authors created Marco-LLM by gathering a large amount of multilingual data specifically for low-resource languages. They used a method called continual pre-training with the Qwen2 model to improve the performance of Marco-LLM across various tasks. The model was tested against many benchmarks and showed significant improvements in translation and understanding tasks compared to existing models.

Why it matters?

This research is important because it helps bridge the gap between high-resource and low-resource languages, making technology more accessible to speakers of all languages. By improving how language models handle multiple languages, Marco-LLM can enhance communication and understanding in our increasingly globalized world.

Abstract

Large Language Models (LLMs) have achieved remarkable progress in recent years; however, their excellent performance is still largely limited to major world languages, primarily English. Many LLMs continue to face challenges with multilingual tasks, especially when it comes to low-resource languages. To address this issue, we introduced Marco-LLM: Massive multilingual training for cross-lingual enhancement LLM. We have collected a substantial amount of multilingual data for several low-resource languages and conducted extensive continual pre-training using the Qwen2 models. This effort has resulted in a multilingual LLM named Marco-LLM. Through comprehensive evaluations on various multilingual benchmarks, including MMMLU, AGIEval, Belebele, Flores-200, XCOPA and many others, Marco-LLM has demonstrated substantial improvements over state-of-the-art LLMs. Furthermore, Marco-LLM achieved substantial enhancements in any-to-any machine translation tasks, showing the effectiveness of our multilingual LLM. Marco-LLM is a pioneering multilingual LLM designed to not only perform exceptionally well in multilingual tasks, including low-resource languages, but also maintain strong performance in English and other major languages, closing the performance gap between high- and low-resource language capabilities. By bridging languages, this effort demonstrates our dedication to ensuring LLMs work accurately across various languages.

View Paper