SeaLLMs 3: Open Foundation and Chat Multilingual Large Language Models for Southeast Asian Languages

Wenxuan Zhang, Hou Pong Chan, Yiran Zhao, Mahani Aljunied, Jianyu Wang, Chaoqun Liu, Yue Deng, Zhiqiang Hu, Weiwen Xu, Yew Ken Chia, Xin Li, Lidong Bing

2024-07-30

SeaLLMs 3: Open Foundation and Chat Multilingual Large Language Models for Southeast Asian Languages

Summary

This paper introduces SeaLLMs 3, a new set of large language models designed specifically for Southeast Asian languages. These models aim to improve language technology support for a region that has been underserved in this area.

What's the problem?

Most advanced language models have been developed with high-resource languages like English and Chinese in mind, leaving many languages in Southeast Asia without adequate support. This creates a gap where speakers of these languages do not have access to the same level of technology and resources as speakers of more widely used languages.

What's the solution?

To address this issue, the authors developed SeaLLMs 3, which includes a wide range of Southeast Asian languages, such as Indonesian, Vietnamese, Thai, and Tagalog. They used efficient training techniques and created a special dataset to reduce costs while maintaining high performance. The models excel in various tasks like translation, reasoning, and following instructions, surpassing previous models in performance and reliability. Additionally, they focused on making the models safe and culturally aware to better serve local communities.

Why it matters?

This research is important because it helps democratize access to advanced language technologies for speakers of Southeast Asian languages. By improving how well these languages are supported by AI, SeaLLMs 3 can enhance communication, education, and technology use in these regions, ultimately benefiting millions of people.

Abstract

Large Language Models (LLMs) have shown remarkable abilities across various tasks, yet their development has predominantly centered on high-resource languages like English and Chinese, leaving low-resource languages underserved. To address this disparity, we present SeaLLMs 3, the latest iteration of the SeaLLMs model family, tailored for Southeast Asian languages. This region, characterized by its rich linguistic diversity, has lacked adequate language technology support. SeaLLMs 3 aims to bridge this gap by covering a comprehensive range of languages spoken in this region, including English, Chinese, Indonesian, Vietnamese, Thai, Tagalog, Malay, Burmese, Khmer, Lao, Tamil, and Javanese. Leveraging efficient language enhancement techniques and a specially constructed instruction tuning dataset, SeaLLMs 3 significantly reduces training costs while maintaining high performance and versatility. Our model excels in tasks such as world knowledge, mathematical reasoning, translation, and instruction following, achieving state-of-the-art performance among similarly sized models. Additionally, we prioritized safety and reliability by addressing both general and culture-specific considerations and incorporated mechanisms to reduce hallucinations. This work underscores the importance of inclusive AI, showing that advanced LLM capabilities can benefit underserved linguistic and cultural communities.

View Paper