BaichuanSEED: Sharing the Potential of ExtensivE Data Collection and Deduplication by Introducing a Competitive Large Language Model Baseline
Guosheng Dong, Da Pan, Yiding Sun, Shusen Zhang, Zheng Liang, Xin Wu, Yanjun Shen, Fan Yang, Haoze Sun, Tianpeng Li, Mingan Lin, Jianhua Xu, Yufan Zhang, Xiaonan Nie, Lei Su, Bingning Wang, Wentao Zhang, Jiaxin Mao, Zenan Zhou, Weipeng Chen
2024-08-29

Summary
This paper discusses BaichuanSEED, a new large language model (LLM) that focuses on improving data collection and processing to enhance AI performance.
What's the problem?
The effectiveness of large language models heavily depends on the quality and variety of the data they are trained on. However, many companies keep their data collection methods secret, making it hard for others to replicate or improve upon their models. Additionally, existing methods for preparing data can be inefficient and may not produce the best results.
What's the solution?
The authors introduce an open-source data processing pipeline that allows for broader data collection and better quality through reweighting. They pretrain a new model called BaichuanSEED using this pipeline with 3 trillion tokens of processed data. After this, they fine-tune the model with supervised learning to improve its performance. Their experiments show that BaichuanSEED performs comparably to other advanced commercial models, demonstrating its effectiveness.
Why it matters?
This research is significant because it provides a transparent approach to building effective language models, allowing others in the field to learn from and build upon their work. By sharing their methods and findings, the authors contribute to the advancement of AI technology, making it more accessible and efficient for various applications.
Abstract
The general capabilities of Large Language Models (LLM) highly rely on the composition and selection on extensive pretraining datasets, treated as commercial secrets by several institutions. To mitigate this issue, we open-source the details of a universally applicable data processing pipeline and validate its effectiveness and potential by introducing a competitive LLM baseline. Specifically, the data processing pipeline consists of broad collection to scale up and reweighting to improve quality. We then pretrain a 7B model BaichuanSEED with 3T tokens processed by our pipeline without any deliberate downstream task-related optimization, followed by an easy but effective supervised fine-tuning stage. BaichuanSEED demonstrates consistency and predictability throughout training and achieves comparable performance on comprehensive benchmarks with several commercial advanced large language models, such as Qwen1.5 and Llama3. We also conduct several heuristic experiments to discuss the potential for further optimization of downstream tasks, such as mathematics and coding.