FuxiTranyu: A Multilingual Large Language Model Trained with Balanced Data

Haoran Sun, Renren Jin, Shaoyang Xu, Leiyu Pan, Supryadi, Menglong Cui, Jiangcun Du, Yikun Lei, Lei Yang, Ling Shi, Juesi Xiao, Shaolin Zhu, Deyi Xiong

2024-08-14

FuxiTranyu: A Multilingual Large Language Model Trained with Balanced Data

Summary

This paper introduces FuxiTranyu, a multilingual large language model designed to perform well across many languages, especially those that are less commonly used.

What's the problem?

Many existing language models work great for popular languages like English but struggle with languages that have fewer resources or data available. This creates a gap in performance, making it hard for speakers of those languages to benefit from advanced AI tools.

What's the solution?

FuxiTranyu is built from scratch using a balanced dataset that includes 600 billion tokens from 43 natural languages and 16 programming languages. The model has different versions: the base model (FuxiTranyu-8B) and two instruction-tuned versions that enhance its ability to follow instructions and preferences. This comprehensive training helps the model perform better on various tasks compared to other multilingual models.

Why it matters?

This research is important because it aims to improve access to AI technology for speakers of all languages, not just the most common ones. By providing a high-performing multilingual model, FuxiTranyu can help bridge the gap in AI capabilities, allowing more people to benefit from advancements in technology and research.

Abstract

Large language models (LLMs) have demonstrated prowess in a wide range of tasks. However, many LLMs exhibit significant performance discrepancies between high- and low-resource languages. To mitigate this challenge, we present FuxiTranyu, an open-source multilingual LLM, which is designed to satisfy the need of the research community for balanced and high-performing multilingual capabilities. FuxiTranyu-8B, the base model with 8 billion parameters, is trained from scratch on a meticulously balanced multilingual data repository that contains 600 billion tokens covering 43 natural languages and 16 programming languages. In addition to the base model, we also develop two instruction-tuned models: FuxiTranyu-8B-SFT that is fine-tuned on a diverse multilingual instruction dataset, and FuxiTranyu-8B-DPO that is further refined with DPO on a preference dataset for enhanced alignment ability. Extensive experiments on a wide range of multilingual benchmarks demonstrate the competitive performance of FuxiTranyu against existing multilingual LLMs, e.g., BLOOM-7B, PolyLM-13B, Llama-2-Chat-7B and Mistral-7B-Instruct. Interpretability analyses at both the neuron and representation level suggest that FuxiTranyu is able to learn consistent multilingual representations across different languages. To promote further research into multilingual LLMs and their working mechanisms, we release both the base and instruction-tuned FuxiTranyu models together with 58 pretraining checkpoints at HuggingFace and Github.

View Paper