Project Alexandria: Towards Freeing Scientific Knowledge from Copyright Burdens via LLMs

Christoph Schuhmann, Gollam Rabby, Ameya Prabhu, Tawsif Ahmed, Andreas Hochlehnert, Huu Nguyen, Nick Akinci Heidrich, Ludwig Schmidt, Robert Kaczmarczyk, Sören Auer, Jenia Jitsev, Matthias Bethge

2025-02-27

Project Alexandria: Towards Freeing Scientific Knowledge from Copyright
Burdens via LLMs

Summary

This paper talks about a new way to share scientific knowledge without breaking copyright laws. The researchers propose using AI to turn scientific papers into something called Knowledge Units, which contain just the facts without any of the original writing style.

What's the problem?

Scientific knowledge is often locked behind paywalls or restricted by copyright laws. This makes it hard for many people, especially those in developing countries or smaller schools, to access important research. Current methods of sharing this information, like summarizing or paraphrasing, aren't always legally safe or good at keeping all the important facts.

What's the solution?

The researchers came up with a new idea called Project Alexandria. They use advanced AI (large language models) to read scientific papers and extract just the factual information. This information is then organized into Knowledge Units, which are structured data that capture the key facts, relationships, and ideas from the original text, but without copying the actual words or style. They tested this method and found it keeps about 95% of the important information while avoiding copyright issues.

Why it matters?

This research could revolutionize how scientific knowledge is shared and used. By making it easier to access and reuse important scientific facts without worrying about copyright, it could speed up research and education worldwide. It's especially important for people who currently can't afford access to many scientific journals. The researchers even created tools that anyone can use to convert scientific papers into these Knowledge Units, which could help democratize access to scientific information while still respecting the original authors' rights.

Abstract

Paywalls, licenses and copyright rules often restrict the broad dissemination and reuse of scientific knowledge. We take the position that it is both legally and technically feasible to extract the scientific knowledge in scholarly texts. Current methods, like text embeddings, fail to reliably preserve factual content, and simple paraphrasing may not be legally sound. We urge the community to adopt a new idea: convert scholarly documents into Knowledge Units using LLMs. These units use structured data capturing entities, attributes and relationships without stylistic content. We provide evidence that Knowledge Units: (1) form a legally defensible framework for sharing knowledge from copyrighted research texts, based on legal analyses of German copyright law and U.S. Fair Use doctrine, and (2) preserve most (~95%) factual knowledge from original text, measured by MCQ performance on facts from the original copyrighted text across four research domains. Freeing scientific knowledge from copyright promises transformative benefits for scientific research and education by allowing language models to reuse important facts from copyrighted text. To support this, we share open-source tools for converting research documents into Knowledge Units. Overall, our work posits the feasibility of democratizing access to scientific knowledge while respecting copyright.

View Paper