KORMo: Korean Open Reasoning Model for Everyone

Minjun Kim, Hyeonseok Lim, Hangyeol Yoo, Inho Won, Seungwoo Song, Minkyung Cho, Junhun Yuk, Changsu Choi, Dongjae Shin, Huige Lee, Hoyun Song, Alice Oh, Kyungtae Lim

2025-10-13

KORMo: Korean Open Reasoning Model for Everyone

Summary

This research introduces KORMo-10B, a new large language model specifically designed for the Korean language, and importantly, it's built using mostly artificial, or synthetic, data.

What's the problem?

Creating powerful language models usually requires massive amounts of text data, but getting enough high-quality text for languages other than English, like Korean, can be really difficult and expensive. Existing multilingual models don't always perform as well in Korean as they do in English, and building a truly open-source Korean language model from scratch was a challenge.

What's the solution?

The researchers tackled this by training KORMo-10B almost entirely on data they *created* using computers. They carefully designed this synthetic data to cover a wide range of Korean language features and different ways of giving instructions to the model. They then showed that this synthetic data didn't cause the model to fail during training and that the resulting model performed just as well as other available models on various tests of reasoning, knowledge, and following instructions.

Why it matters?

This work is significant because it demonstrates that you can build a strong language model for a language with limited resources, like Korean, without relying on huge amounts of real-world text data. By releasing all the data, code, and training details, they've created a blueprint for others to build similar models for other underrepresented languages, promoting more inclusive AI development.

Abstract

This work presents the first large-scale investigation into constructing a fully open bilingual large language model (LLM) for a non-English language, specifically Korean, trained predominantly on synthetic data. We introduce KORMo-10B, a 10.8B-parameter model trained from scratch on a Korean-English corpus in which 68.74% of the Korean portion is synthetic. Through systematic experimentation, we demonstrate that synthetic data, when carefully curated with balanced linguistic coverage and diverse instruction styles, does not cause instability or degradation during large-scale pretraining. Furthermore, the model achieves performance comparable to that of contemporary open-weight multilingual baselines across a wide range of reasoning, knowledge, and instruction-following benchmarks. Our experiments reveal two key findings: (1) synthetic data can reliably sustain long-horizon pretraining without model collapse, and (2) bilingual instruction tuning enables near-native reasoning and discourse coherence in Korean. By fully releasing all components including data, code, training recipes, and logs, this work establishes a transparent framework for developing synthetic data-driven fully open models (FOMs) in low-resource settings and sets a reproducible precedent for future multilingual LLM research.

View Paper