FIN-bench-v2: A Unified and Robust Benchmark Suite for Evaluating Finnish Large Language Models

Joona Kytöniemi, Jousia Piha, Akseli Reunamo, Fedor Vitiugin, Farrokh Mehryary, Sampo Pyysalo

2025-12-16

FIN-bench-v2: A Unified and Robust Benchmark Suite for Evaluating Finnish Large Language Models

Summary

This paper introduces FIN-bench-v2, a comprehensive set of tests designed to measure how well large language models – the AI behind things like chatbots – understand and work with the Finnish language.

What's the problem?

Evaluating AI models is tricky, especially for languages other than English. Existing Finnish language tests were scattered, used different formats, and weren't always reliable for comparing models. It was hard to get a clear picture of which models were *actually* good at Finnish, and some tests might have been accidentally easy or misleading.

What's the solution?

The researchers created FIN-bench-v2 by combining and standardizing existing Finnish language tests. They made sure all the tests were in a common format that’s easy for developers to use with popular AI tools. They also used smaller AI models to carefully select the *best* tests – ones that truly challenge the AI and give meaningful results. Finally, they tested several larger, more powerful AI models using these refined tests.

Why it matters?

This work is important because it provides a reliable way to assess AI performance in Finnish. This will help developers build better AI tools for Finnish speakers, and it allows researchers to track progress in AI’s ability to understand and generate Finnish text. Essentially, it’s a crucial step towards making AI more accessible and useful in Finland.

Abstract

We introduce FIN-bench-v2, a unified benchmark suite for evaluating large language models in Finnish. FIN-bench-v2 consolidates Finnish versions of widely used benchmarks together with an updated and expanded version of the original FIN-bench into a single, consistently formatted collection, covering multiple-choice and generative tasks across reading comprehension, commonsense reasoning, sentiment analysis, world knowledge, and alignment. All datasets are converted to HuggingFace Datasets, which include both cloze and multiple-choice prompt formulations with five variants per task, and we incorporate human annotation or review for machine-translated resources such as GoldenSwag and XED. To select robust tasks, we pretrain a set of 2.15B-parameter decoder-only models and use their learning curves to compute monotonicity, signal-to-noise, non-random performance, and model ordering consistency, retaining only tasks that satisfy all criteria. We further evaluate a set of larger instruction-tuned models to characterize performance across tasks and prompt formulations. All datasets, prompts, and evaluation configurations are publicly available via our fork of the Language Model Evaluation Harness at https://github.com/LumiOpen/lm-evaluation-harness. Supplementary resources are released in a separate repository at https://github.com/TurkuNLP/FIN-bench-v2.

View Paper