Consent in Crisis: The Rapid Decline of the AI Data Commons

Shayne Longpre, Robert Mahari, Ariel Lee, Campbell Lund, Hamidah Oderinwale, William Brannon, Nayan Saxena, Naana Obeng-Marnu, Tobin South, Cole Hunter, Kevin Klyman, Christopher Klamm, Hailey Schoelkopf, Nikhil Singh, Manuel Cherep, Ahmad Anis, An Dinh, Caroline Chitongo, Da Yin, Damien Sileo, Deividas Mataciunas, Diganta Misra

2024-07-23

Consent in Crisis: The Rapid Decline of the AI Data Commons

Summary

This paper discusses the decline of consent for using public web data in training artificial intelligence (AI) systems. It highlights how many websites are now restricting access to their content, which could impact the development of AI technologies.

What's the problem?

AI systems rely heavily on large amounts of data from the internet to learn and improve. However, many websites are starting to limit or deny permission for their content to be used in AI training. This creates a problem because if AI developers cannot access enough diverse and fresh data, it can hinder the performance and reliability of AI models. The study found that in just one year, significant portions of data that were previously available have been restricted, making it harder for AI systems to function effectively.

What's the solution?

The authors conducted a comprehensive audit of 14,000 web domains to analyze how consent preferences for using data have changed over time. They discovered that many websites have added specific clauses to their terms of service that limit how AI developers can use their content. The paper documents these changes and emphasizes the need for better protocols that accommodate the growing use of web data for AI training, suggesting that current web rules are not keeping up with technological advancements.

Why it matters?

This research is important because it highlights an emerging crisis in data consent that could significantly affect both commercial and academic AI research. If these restrictions continue, they could limit the availability of open data on the internet, which is crucial for developing advanced AI systems. Understanding these trends is essential for ensuring that future AI technologies remain effective and diverse.

Abstract

General-purpose artificial intelligence (AI) systems are built on massive swathes of public web data, assembled into corpora such as C4, RefinedWeb, and Dolma. To our knowledge, we conduct the first, large-scale, longitudinal audit of the consent protocols for the web domains underlying AI training corpora. Our audit of 14,000 web domains provides an expansive view of crawlable web data and how consent preferences to use it are changing over time. We observe a proliferation of AI-specific clauses to limit use, acute differences in restrictions on AI developers, as well as general inconsistencies between websites' expressed intentions in their Terms of Service and their robots.txt. We diagnose these as symptoms of ineffective web protocols, not designed to cope with the widespread re-purposing of the internet for AI. Our longitudinal analyses show that in a single year (2023-2024) there has been a rapid crescendo of data restrictions from web sources, rendering ~5%+ of all tokens in C4, or 28%+ of the most actively maintained, critical sources in C4, fully restricted from use. For Terms of Service crawling restrictions, a full 45% of C4 is now restricted. If respected or enforced, these restrictions are rapidly biasing the diversity, freshness, and scaling laws for general-purpose AI systems. We hope to illustrate the emerging crisis in data consent, foreclosing much of the open web, not only for commercial AI, but non-commercial AI and academic purposes.

View Paper