CCI3.0-HQ: a large-scale Chinese dataset of high quality designed for pre-training large language models

Liangdong Wang, Bo-Wen Zhang, Chengwei Wu, Hanyu Zhao, Xiaofeng Shi, Shuhao Gu, Jijie Li, Quanyue Ma, TengFei Pan, Guang Liu

2024-10-25

CCI3.0-HQ: a large-scale Chinese dataset of high quality designed for pre-training large language models

Summary

This paper introduces CCI3.0-HQ, a large-scale, high-quality dataset designed for pre-training large language models (LLMs) specifically for the Chinese language.

What's the problem?

Creating effective language models requires high-quality training data, but many existing datasets are not well-curated or are too large and unwieldy. This can lead to poor performance in understanding and generating Chinese text, which is a problem for developers looking to improve AI systems for Chinese speakers.

What's the solution?

The authors developed CCI3.0-HQ, a 500GB subset of the larger Chinese Corpora Internet 3.0 (CCI3.0), using a two-stage hybrid filtering process to ensure high data quality. They trained a smaller model (0.5 billion parameters) on this curated dataset and achieved better performance on various benchmarks compared to other datasets like CCI3.0, SkyPile, and WanjuanV1. The dataset includes 140,000 training samples that were manually verified for quality by native speakers.

Why it matters?

This research is important because it provides a high-quality resource for training language models that can better understand and generate Chinese text. By improving the quality of training data, CCI3.0-HQ can help advance the development of more effective AI systems for the large population of Chinese speakers, making technology more accessible and useful.

Abstract

We present CCI3.0-HQ (https://huggingface.co/datasets/BAAI/CCI3-HQ), a high-quality 500GB subset of the Chinese Corpora Internet 3.0 (CCI3.0)(https://huggingface.co/datasets/BAAI/CCI3-Data), developed using a novel two-stage hybrid filtering pipeline that significantly enhances data quality. To evaluate its effectiveness, we trained a 0.5B parameter model from scratch on 100B tokens across various datasets, achieving superior performance on 10 benchmarks in a zero-shot setting compared to CCI3.0, SkyPile, and WanjuanV1. The high-quality filtering process effectively distills the capabilities of the Qwen2-72B-instruct model into a compact 0.5B model, attaining optimal F1 scores for Chinese web data classification. We believe this open-access dataset will facilitate broader access to high-quality language models.

View Paper