FiNERweb: Datasets and Artifacts for Scalable Multilingual Named Entity Recognition

Jonas Golde, Patrick Haller, Alan Akbik

2025-12-18

FiNERweb: Datasets and Artifacts for Scalable Multilingual Named Entity Recognition

Summary

This paper introduces FiNERweb, a new dataset designed to help computers understand and identify named entities – like people, organizations, and locations – in text across many different languages.

What's the problem?

Currently, teaching computers to recognize named entities in multiple languages is difficult because good training data is scarce. Existing datasets are often created as a side effect of other research and aren't specifically designed for this task, making them hard to reuse and not very scalable to many languages.

What's the solution?

The researchers created a system that uses a 'teacher-student' approach. First, they used a model to find parts of text that are likely to contain named entities. Then, they used a powerful language model to automatically label those parts in 91 languages. This resulted in a large dataset of over 225,000 text passages with over 235,000 unique entity labels. They also checked the quality of the labels using another language model to act as a judge, and found the labels to be very reliable.

Why it matters?

FiNERweb is important because it provides a reusable and scalable resource for training computers to recognize named entities in many languages. Models trained with this dataset perform well even when tested on languages they haven't specifically been trained on, and it addresses a problem where using labels translated to the target language can actually decrease performance. By releasing the dataset and tools, the researchers hope to encourage further advancements in multilingual named entity recognition.

Abstract

Recent multilingual named entity recognition (NER) work has shown that large language models (LLMs) can provide effective synthetic supervision, yet such datasets have mostly appeared as by-products of broader experiments rather than as systematic, reusable resources. We introduce FiNERweb, a dataset-creation pipeline that scales the teacher-student paradigm to 91 languages and 25 scripts. Building on FineWeb-Edu, our approach trains regression models to identify NER-relevant passages and annotates them with multilingual LLMs, resulting in about 225k passages with 235k distinct entity labels. Our experiments show that the regression model achieves more than 84 F1, and that models trained on FiNERweb obtain comparable or improved performance in zero shot transfer settings on English, Thai, and Swahili, despite being trained on 19x less data than strong baselines. In addition, we assess annotation quality using LLM-as-a-judge and observe consistently high scores for both faithfulness (3.99 out of 5) and completeness (4.05 out of 5), indicating reliable and informative annotations. Further, we release the dataset with both English labels and translated label sets in the respective target languages because we observe that the performance of current state-of-the-art models drops by 0.02 to 0.09 F1 when evaluated using target language labels instead of English ones. We release FiNERweb together with all accompanying artifacts to the research community in order to facilitate more effective student-teacher training for multilingual named entity recognition.

View Paper