YuLan-Mini: An Open Data-efficient Language Model
Yiwen Hu, Huatong Song, Jia Deng, Jiapeng Wang, Jie Chen, Kun Zhou, Yutao Zhu, Jinhao Jiang, Zican Dong, Wayne Xin Zhao, Ji-Rong Wen
2024-12-27

Summary
This paper talks about YuLan-Mini, a new and efficient language model that uses fewer resources while still achieving high performance in understanding and generating text.
What's the problem?
Training large language models (LLMs) usually requires a lot of data and computing power, making it difficult and expensive. Many existing models are very large, with billions of parameters, which can lead to inefficiencies and challenges in training. There's a need for models that can perform well without needing as much data.
What's the solution?
The authors developed YuLan-Mini, a language model with 2.42 billion parameters that is designed to be data-efficient. They achieved this by creating a detailed training process that includes cleaning data, optimizing training stability, and selecting the most relevant data for training. YuLan-Mini was trained on 1.08 trillion tokens but still performs comparably to larger models that require much more data. The authors also provide full details of their training methods to help others replicate their results.
Why it matters?
This research is important because it demonstrates that it's possible to create powerful language models without needing excessive amounts of data or computational resources. By making high-performance models more accessible, YuLan-Mini can benefit various applications, such as natural language processing tasks in education, business, and technology, ultimately making AI tools easier to use and implement.
Abstract
Effective pre-training of large language models (LLMs) has been challenging due to the immense resource demands and the complexity of the technical processes involved. This paper presents a detailed technical report on YuLan-Mini, a highly capable base model with 2.42B parameters that achieves top-tier performance among models of similar parameter scale. Our pre-training approach focuses on enhancing training efficacy through three key technical contributions: an elaborate data pipeline combines data cleaning with data schedule strategies, a robust optimization method to mitigate training instability, and an effective annealing approach that incorporates targeted data selection and long context training. Remarkably, YuLan-Mini, trained on 1.08T tokens, achieves performance comparable to industry-leading models that require significantly more data. To facilitate reproduction, we release the full details of the data composition for each training phase. Project details can be accessed at the following link: https://github.com/RUC-GSAI/YuLan-Mini.