Pushing on Multilingual Reasoning Models with Language-Mixed Chain-of-Thought

Guijin Son, Donghun Yang, Hitesh Laxmichand Patel, Amit Agarwal, Hyunwoo Ko, Chanuk Lim, Srikant Panda, Minhyuk Kim, Nikunj Drolia, Dasol Choi, Kyong-Ha Lee, Youngjae Yu

2025-10-09

Pushing on Multilingual Reasoning Models with Language-Mixed Chain-of-Thought

Summary

This paper focuses on improving the reasoning abilities of AI models in languages other than English, specifically Korean. It addresses the fact that most research on making AI 'think' step-by-step (chain-of-thought reasoning) has been done with English language models, and we don't know how well those techniques transfer to other languages.

What's the problem?

Large language models are getting better at complex reasoning by thinking through problems step-by-step, but this ability is mostly developed for English. Applying these methods to other languages is difficult because direct translation can introduce errors and doesn't always capture the nuances of how people reason in that language. There's a lack of good datasets and models specifically designed to test and improve reasoning in non-English languages.

What's the solution?

The researchers created a new method called 'Language-Mixed CoT' where the AI switches between English and the target language (Korean in this case) during its reasoning process. English acts as a stable base, while the target language handles the specific problem. They also built a large, high-quality dataset of Korean questions and reasoning traces called 'Yi-Sang' and used it to train several new AI models, with their best model being 'KO-REAson-35B'.

Why it matters?

This work is important because it shows that we can significantly improve AI reasoning in languages other than English by carefully considering the language itself. The new dataset and models are publicly available, which will help other researchers build even better AI systems for a wider range of languages and applications, moving beyond the English-centric focus of current AI development.

Abstract

Recent frontier models employ long chain-of-thought reasoning to explore solution spaces in context and achieve stonger performance. While many works study distillation to build smaller yet capable models, most focus on English and little is known about language-specific reasoning. To bridge this gap, we first introduct **Language-Mixed CoT**, a reasoning schema that switches between English and a target language, using English as an anchor to excel in reasoning while minimizing translation artificats. As a Korean case study, we curate **Yi-Sang**: 5.79M native-Korean prompts from web Q&A, exams, STEM, and code; 3.7M long reasoning traces generated from Qwen3-32B; and a targeted 260k high-yield subset. We train ninve models (4B-35B) across six families (Qwen2.5, Llama-3.1, Gemma-3, etc). Our best model, **KO-REAson-35B**, achieves state-of-the-art performance, with the highest overall average score (64.0 \pm 25), ranking first on 5/9 benchmarks and second on the remainder. Samller and mid-sized models also benefit substantially, with an average improvement of +18.6 points across teh evaluated nine benchmarks. Ablations show **Language-Mixed CoT** is more effective than monolingual CoT, also resulting in cross-lingual and mult-modal performance gains. We release our data-curation pipeline, evaluation system, datasets, and models to advance research on language-specific reasoning. Data and model collection: https://huggingface.co/KOREAson.

View Paper