The Impact of Copyrighted Material on Large Language Models: A Norwegian Perspective

Javier de la Rosa, Vladislav Mikhailov, Lemei Zhang, Freddy Wetjen, David Samuel, Peng Liu, Rolv-Arild Braaten, Petter Mæhlum, Magnus Breder Birkenes, Andrey Kutuzov, Tita Enstad, Svein Arne Brygfjeld, Jon Atle Gulla, Stephan Oepen, Erik Velldal, Wilfred Østgulen, Liljia Øvrelid, Aslak Sira Myhre

2024-12-13

The Impact of Copyrighted Material on Large Language Models: A Norwegian Perspective

Summary

This paper discusses how using copyrighted materials, like books and newspapers, affects the performance of large language models (LLMs) in generating text, specifically from a Norwegian perspective.

What's the problem?

When training AI models to understand and generate language, using copyrighted materials raises important legal and ethical issues. Current methods often struggle to determine how these materials impact the model's performance and whether authors should be compensated for their work being used in AI training.

What's the solution?

The authors of the paper conducted experiments to evaluate the effects of different types of copyrighted content on LLMs. They created a set of datasets that included both copyrighted and non-copyrighted materials in Norwegian. By comparing how well models performed on various tasks using these datasets, they were able to identify which types of content helped or hindered the models' abilities. They found that while books and newspapers improved performance, fiction works might actually decrease it.

Why it matters?

This research is significant because it provides insights into how copyrighted materials influence AI development. Understanding these effects can help inform policies about compensating authors and creators whose works contribute to training AI systems, ensuring that the rights of content creators are respected while advancing technology.

Abstract

The use of copyrighted materials in training generative language models raises critical legal and ethical questions. This paper presents a framework for and the results of empirically assessing the impact of copyrighted materials on the performance of large language models (LLMs) for Norwegian. We found that both books and newspapers contribute positively when the models are evaluated on a diverse set of Norwegian benchmarks, while fiction works possibly lead to decreased performance. Our experiments could inform the creation of a compensation scheme for authors whose works contribute to AI development.

View Paper