Xmodel-2.5: 1.3B Data-Efficient Reasoning SLM
Yang Liu, Xiaolong Zhong, Ling Jiang
2025-12-01
Summary
This paper introduces Xmodel-2.5, a relatively small but powerful language model designed to be used in applications where large models are too computationally expensive or impractical, like on phones or in situations where cost is a major concern.
What's the problem?
Large language models, while good at tasks like reasoning and using tools, require a lot of computing power, making them difficult to use on devices with limited resources or in situations where running them is expensive. Essentially, they're too big and slow for many real-world applications.
What's the solution?
The researchers created Xmodel-2.5, a smaller model with 1.3 billion parameters. They used a clever training technique called 'maximal-update parameterization' which allowed them to efficiently scale up training from a tiny test model to the full-sized Xmodel-2.5. They also found that switching optimizers during training – using AdamW initially for stability and then Muon for fine-tuning – improved performance. Finally, they used a technique called FP8-mixed-precision training to speed things up without losing accuracy.
Why it matters?
This work is important because it demonstrates how to build smaller language models that still perform well. This opens the door to using these models in more places, like on your phone, in cars, or in other devices where large models simply won't fit or are too costly to run. It makes advanced AI more accessible and practical for a wider range of applications.
Abstract
Large language models deliver strong reasoning and tool-use skills, yet their computational demands make them impractical for edge or cost-sensitive deployments. We present Xmodel-2.5, a 1.3-billion-parameter small language model designed as a drop-in agent core. Training with maximal-update parameterization (μP) allows hyper-parameters tuned on a 20M-parameter proxy to transfer directly to the full model, even under the parameter-tied tie-word-embedding architecture. A 1.4T-token Warmup--Stable--Decay curriculum is used, and we further show that switching from AdamW to Muon during the decay phase improves the 13-task reasoning average by 4.58\,\% while keeping every other hyper-parameter fixed, verifying that early AdamW stability can be paired with late Muon sharpening for better downstream performance. FP8-mixed-precision training balances accuracy and throughput. All checkpoints, recipes, and evaluation code are released under the Apache-2.0 license.https://huggingface.co/XiaoduoAILab/Xmodel-2.5 and https://huggingface.co/XiaoduoAILab/Xmodel-2.5-history (training checkpoints). Training code and evaluation harness: https://github.com/XiaoduoAILab/Xmodel-2.5.