Evaluating Multimodal Generative AI with Korean Educational Standards

Sanghee Park, Geewook Kim

2025-02-24

Evaluating Multimodal Generative AI with Korean Educational Standards

Summary

This paper talks about KoNET, a new way to test how well AI systems can understand and answer questions in Korean, using real exams from different school levels in Korea.

What's the problem?

Most tests for AI are in English, which doesn't show how well these systems work in other languages. Also, it's hard to know if AI can handle the same kind of tough questions that students face in real exams, especially in languages like Korean.

What's the solution?

The researchers created KoNET, which uses actual Korean school tests from elementary to college level. They tested different types of AI models on these exams, looking at how well they do with various subjects and comparing their mistakes to those made by human students. They're also making all their testing tools available for free online so other researchers can use them.

Why it matters?

This matters because it helps us see how good AI really is at understanding and working with languages other than English. It could lead to better AI systems that can help students and teachers in Korea and other countries where English isn't the main language. By using real school tests, it also shows if AI is ready to be used in education and where it needs to improve.

Abstract

This paper presents the Korean National Educational Test Benchmark (KoNET), a new benchmark designed to evaluate Multimodal Generative AI Systems using Korean national educational tests. KoNET comprises four exams: the Korean Elementary General Educational Development Test (KoEGED), Middle (KoMGED), High (KoHGED), and College Scholastic Ability Test (KoCSAT). These exams are renowned for their rigorous standards and diverse questions, facilitating a comprehensive analysis of AI performance across different educational levels. By focusing on Korean, KoNET provides insights into model performance in less-explored languages. We assess a range of models - open-source, open-access, and closed APIs - by examining difficulties, subject diversity, and human error rates. The code and dataset builder will be made fully open-sourced at https://github.com/naver-ai/KoNET.

View Paper