Steel-LLM:From Scratch to Open Source -- A Personal Journey in Building a Chinese-Centric LLM
Qingshui Gu, Shu Li, Tianyu Zheng, Zhaoxiang Zhang
2025-02-11
Summary
This paper talks about Steel-LLM, a new AI language model that focuses on understanding and generating Chinese text. It was created by researchers who wanted to make a powerful AI tool that anyone could use and learn from, even if they don't have access to expensive computers.
What's the problem?
Many existing AI language models are either not very good at handling Chinese language or are kept secret by big companies. Also, most people don't have the super powerful computers usually needed to create these AI models. This makes it hard for researchers and students to learn how to build their own AI tools, especially for languages other than English.
What's the solution?
The researchers built Steel-LLM from scratch, using clever techniques to make it work well even with limited computer power. They used mostly Chinese data to train the AI, with a bit of English mixed in. They also made sure to write down everything they did, including the problems they faced, so others could learn from their experience. They then made all of this information, including the AI model itself, freely available online for anyone to use or study.
Why it matters?
This matters because it shows that you don't need to be a big tech company to create powerful AI tools. By making everything open and explaining their process, the Steel-LLM team is helping more people learn about and create AI, especially for languages like Chinese. This could lead to more diverse and accessible AI tools in the future, which could be used in education, research, and many other fields.
Abstract
Steel-LLM is a Chinese-centric language model developed from scratch with the goal of creating a high-quality, open-source model despite limited computational resources. Launched in March 2024, the project aimed to train a 1-billion-parameter model on a large-scale dataset, prioritizing transparency and the sharing of practical insights to assist others in the community. The training process primarily focused on Chinese data, with a small proportion of English data included, addressing gaps in existing open-source LLMs by providing a more detailed and practical account of the model-building journey. Steel-LLM has demonstrated competitive performance on benchmarks such as CEVAL and CMMLU, outperforming early models from larger institutions. This paper provides a comprehensive summary of the project's key contributions, including data collection, model design, training methodologies, and the challenges encountered along the way, offering a valuable resource for researchers and practitioners looking to develop their own LLMs. The model checkpoints and training script are available at https://github.com/zhanshijinwat/Steel-LLM.