Solar Open Technical Report
Sungrae Park, Sanghoon Kim, Jungho Cho, Gyoungjin Gim, Dawoon Jung, Mikyoung Cha, Eunhae Choo, Taekgyu Hong, Minbyul Jeong, SeHwan Joo, Minsoo Khang, Eunwon Kim, Minjeong Kim, Sujeong Kim, Yunsu Kim, Hyeonju Lee, Seunghyun Lee, Sukyung Lee, Siyoung Park, Gyungin Shin, Inseo Song, Wonho Song
2026-01-14
Summary
This paper introduces Solar Open, a really large language model with 102 billion parameters that's designed to work well with languages that don't get a lot of attention in AI research, specifically focusing on both English and Korean.
What's the problem?
Building powerful language models usually requires massive amounts of text data, but many languages don't *have* that much data available. This makes it hard to create AI that can understand and generate text effectively in those languages. The challenge is how to create a competitive model when you're starting with limited resources and ensuring it can actually *reason* and not just repeat things it's seen.
What's the solution?
The researchers tackled this problem in three main ways. First, they created a huge dataset – 4.5 trillion pieces of text – that was specifically designed to be high-quality, relevant to specific topics, and useful for a type of AI training called reinforcement learning. Second, they carefully organized this data, gradually increasing the complexity and ensuring a good mix of topics and quality. Finally, they used a new training technique called SnapPO to efficiently improve the model's reasoning abilities using reinforcement learning.
Why it matters?
This work is important because it shows that it's possible to build strong language models for languages that are often overlooked. It provides a practical method for developing AI that can benefit a wider range of people and cultures, and it pushes the boundaries of what's achievable even with limited data. It demonstrates a pathway to more inclusive AI development.
Abstract
We introduce Solar Open, a 102B-parameter bilingual Mixture-of-Experts language model for underserved languages. Solar Open demonstrates a systematic methodology for building competitive LLMs by addressing three interconnected challenges. First, to train effectively despite data scarcity for underserved languages, we synthesize 4.5T tokens of high-quality, domain-specific, and RL-oriented data. Second, we coordinate this data through a progressive curriculum jointly optimizing composition, quality thresholds, and domain coverage across 20 trillion tokens. Third, to enable reasoning capabilities through scalable RL, we apply our proposed framework SnapPO for efficient optimization. Across benchmarks in English and Korean, Solar Open achieves competitive performance, demonstrating the effectiveness of this methodology for underserved language AI development.