OpenCSG Chinese Corpus: A Series of High-quality Chinese Datasets for LLM Training

Yijiong Yu, Ziyun Dai, Zekun Wang, Wei Wang, Ran Chen, Ji Pei

2025-01-15

OpenCSG Chinese Corpus: A Series of High-quality Chinese Datasets for LLM Training

Summary

This paper talks about a new collection of high-quality Chinese language datasets called the OpenCSG Chinese Corpus. It's designed to help train artificial intelligence models to understand and generate Chinese text better.

What's the problem?

AI language models are getting really good at understanding and writing in many languages, but they're not as good with Chinese. This is because there aren't enough high-quality Chinese datasets for these AI models to learn from. It's like trying to learn a language without good textbooks or practice materials.

What's the solution?

The researchers created the OpenCSG Chinese Corpus, which is like a super-library of Chinese text for AI to learn from. They made four different datasets: two called Fineweb-edu that have high-quality content from Chinese websites, Cosmopedia-chinese which is like an AI textbook, and Smoltalk-chinese which helps AI learn how to chat naturally in Chinese. They made sure these datasets cover lots of different topics and are easy for other researchers to use and build upon.

Why it matters?

This matters because it could help make AI much better at understanding and communicating in Chinese. Better Chinese-speaking AI could lead to improved translation tools, more accurate search engines for Chinese content, and smarter virtual assistants for Chinese speakers. It's also important for making sure AI technology works well for people all around the world, not just in English-speaking countries. The researchers tested their new datasets and found that AI models trained on them performed much better on Chinese language tasks, which shows that their approach really works.

Abstract

Large language models (LLMs) have demonstrated remarkable capabilities, but their success heavily relies on the quality of pretraining corpora. For Chinese LLMs, the scarcity of high-quality Chinese datasets presents a significant challenge, often limiting their performance. To address this issue, we propose the OpenCSG Chinese Corpus, a series of high-quality datasets specifically designed for LLM pretraining, post-training, and fine-tuning. This corpus includes Fineweb-edu-chinese, Fineweb-edu-chinese-v2, Cosmopedia-chinese, and Smoltalk-chinese, each with distinct characteristics: Fineweb-edu datasets focus on filtered, high-quality content derived from diverse Chinese web sources; Cosmopedia-chinese provides synthetic, textbook-style data for knowledge-intensive training; and Smoltalk-chinese emphasizes stylistic and diverse chat-format data. The OpenCSG Chinese Corpus is characterized by its high-quality text, diverse coverage across domains, and scalable, reproducible data curation processes. Additionally, we conducted extensive experimental analyses, including evaluations on smaller parameter models, which demonstrated significant performance improvements in tasks such as C-Eval, showcasing the effectiveness of the corpus for training Chinese LLMs.

View Paper