Dynaword: From One-shot to Continuously Developed Datasets

Kenneth Enevoldsen, Kristian Nørgaard Jensen, Jan Kostkan, Balázs Szabó, Márton Kardos, Kirten Vad, Andrea Blasi Núñez, Gianluca Barmina, Jacob Nielsen, Rasmus Larsen, Peter Vahlstrup, Per Møldrup Dalum, Desmond Elliott, Lukas Galke, Peter Schneider-Kamp, Kristoffer Nielbo

2025-08-05

Dynaword: From One-shot to Continuously Developed Datasets

Summary

This paper talks about Dynaword, a new framework that helps create and keep updating large language datasets continuously with help from the community, making the datasets better over time.

What's the problem?

The problem is that most language datasets are created once and then stay the same, which means they can get outdated or limited as language changes or new topics come up.

What's the solution?

Dynaword solves this by allowing people to add, fix, and improve dataset content regularly, creating a system where the dataset grows and stays relevant through community contributions and automatic updates.

Why it matters?

This matters because having large, up-to-date, and diverse language datasets helps improve language models and AI systems, making them smarter, more accurate, and better at understanding current language use.

Abstract

A framework called Dynaword and its implementation Danish Dynaword enable community-driven, open, and continuously updated large-scale natural language datasets.

View Paper