Kanana: Compute-efficient Bilingual Language Models
Kanana LLM Team, Yunju Bak, Hojin Lee, Minho Ryu, Jiyeon Ham, Seungjae Jung, Daniel Wontae Nam, Taegyeong Eo, Donghun Lee, Doohae Jung, Boseop Kim, Nayeon Kim, Jaesun Park, Hyunho Kim, Hyunwoong Ko, Changmin Lee, Kyoung-Woon On, Seulye Baeg, Junrae Cho, Sunghee Jung, Jieun Kang, EungGyun Kim
2025-02-27
Summary
This paper talks about Kanana, a new type of AI language model that can understand and generate text in both Korean and English really well, while using less computer power than other similar AI models
What's the problem?
Current AI language models that work with multiple languages often need a lot of computing power, which makes them expensive and hard to use. Also, there aren't many good AI models specifically designed for Korean language
What's the solution?
The researchers created Kanana, which uses clever techniques like carefully choosing training data, training the model in stages, and trimming unnecessary parts. They also fine-tuned the model to make it better at talking with people. The team made different versions of Kanana, from smaller ones with 2.1 billion parameters to bigger ones with 32.5 billion parameters
Why it matters?
This matters because it makes powerful AI language tools more accessible, especially for Korean language research. It could lead to better translation services, more natural chatbots, and other language technologies that work well in both Korean and English without needing super expensive computers. By making some versions of Kanana public, the researchers are helping other scientists improve language AI, particularly for Korean
Abstract
We introduce Kanana, a series of bilingual language models that demonstrate exceeding performance in Korean and competitive performance in English. The computational cost of Kanana is significantly lower than that of state-of-the-art models of similar size. The report details the techniques employed during pre-training to achieve compute-efficient yet competitive models, including high quality data filtering, staged pre-training, depth up-scaling, and pruning and distillation. Furthermore, the report outlines the methodologies utilized during the post-training of the Kanana models, encompassing supervised fine-tuning and preference optimization, aimed at enhancing their capability for seamless interaction with users. Lastly, the report elaborates on plausible approaches used for language model adaptation to specific scenarios, such as embedding, retrieval augmented generation, and function calling. The Kanana model series spans from 2.1B to 32.5B parameters with 2.1B models (base, instruct, embedding) publicly released to promote research on Korean language models.