Motif 2 12.7B technical report

Junghwan Lim, Sungmin Lee, Dongseok Kim, Taehyun Kim, Eunhwan Park, Jeesoo Lee, Jeongdoo Lee, Junhyeok Lee, Wai Ting Cheung, Dahye Choi, Jaeheui Her, Jaeyeon Huh, Hanbin Jung, Changjin Kang, Beomgyu Kim, Minjae Kim, Taewhan Kim, Youngrok Kim, Hyukjin Kweon, Haesol Lee, Kungyu Lee, Dongpin Oh

2025-11-13

Summary

This paper introduces a new large language model called Motif-2-12.7B, which aims to be very efficient and perform well even though it's not as huge as some other models.

What's the problem?

Large language models are getting incredibly big, which means they require a lot of computing power and money to train and run. The challenge is to create a model that can still understand and generate language effectively, but without needing massive resources. Existing models often struggle to balance performance with efficiency, especially when dealing with complex instructions.

What's the solution?

The researchers built Motif-2-12.7B by improving on a smaller model (Motif-2.6B). They added a new technique called Grouped Differential Attention, which helps the model focus on the important parts of the text and ignore noise. They also carefully chose the data the model was trained on, gradually changing the types of information it saw. Finally, they used special software and optimization tricks to make the training process faster and more efficient, and then fine-tuned the model in stages to improve its ability to follow instructions and use language accurately.

Why it matters?

This work is important because it shows that you don't necessarily need a gigantic model to achieve good performance. By focusing on clever design choices and efficient training methods, they were able to create a model that competes with much larger ones. This could make powerful language models more accessible to researchers and developers who don't have access to enormous computing resources, and it points the way towards building more sustainable and practical AI systems.

Abstract

We introduce Motif-2-12.7B, a new open-weight foundation model that pushes the efficiency frontier of large language models by combining architectural innovation with system-level optimization. Designed for scalable language understanding and robust instruction generalization under constrained compute budgets, Motif-2-12.7B builds upon Motif-2.6B with the integration of Grouped Differential Attention (GDA), which improves representational efficiency by disentangling signal and noise-control attention pathways. The model is pre-trained on 5.5 trillion tokens spanning diverse linguistic, mathematical, scientific, and programming domains using a curriculum-driven data scheduler that gradually changes the data composition ratio. The training system leverages the MuonClip optimizer alongside custom high-performance kernels, including fused PolyNorm activations and the Parallel Muon algorithm, yielding significant throughput and memory efficiency gains in large-scale distributed environments. Post-training employs a three-stage supervised fine-tuning pipeline that successively enhances general instruction adherence, compositional understanding, and linguistic precision. Motif-2-12.7B demonstrates competitive performance across diverse benchmarks, showing that thoughtful architectural scaling and optimized training design can rival the capabilities of much larger models.

View Paper