Craw4LLM: Efficient Web Crawling for LLM Pretraining

Shi Yu, Zhiyuan Liu, Chenyan Xiong

2025-02-20

Craw4LLM: Efficient Web Crawling for LLM Pretraining

Summary

This paper talks about Craw4LLM, a new method for collecting high-quality data from the internet to train large language models (LLMs) more efficiently. It focuses on finding and using only the most useful web pages for training, instead of wasting time and resources on low-quality data.

What's the problem?

Traditional web crawlers collect a lot of web pages, but most of them are not useful for training LLMs and end up being thrown away. This leads to wasted resources and puts unnecessary strain on websites, making the process inefficient and unsustainable.

What's the solution?

The researchers developed Craw4LLM, which uses a smarter way to decide which web pages to collect. Instead of just looking at how connected a webpage is, Craw4LLM assigns a score based on how much a page can help in training LLMs. By prioritizing high-scoring pages, it collects only the most valuable data. Experiments showed that Craw4LLM could achieve the same training results as older methods while crawling only 21% of the web pages, significantly reducing waste.

Why it matters?

This matters because it makes training LLMs more efficient and less wasteful. By focusing on high-quality data, Craw4LLM reduces the burden on websites and saves computing resources. This approach could help create better AI models faster while being more environmentally friendly and respectful of internet resources.

Abstract

Web crawl is a main source of large language models' (LLMs) pretraining data, but the majority of crawled web pages are discarded in pretraining due to low data quality. This paper presents Crawl4LLM, an efficient web crawling method that explores the web graph based on the preference of LLM pretraining. Specifically, it leverages the influence of a webpage in LLM pretraining as the priority score of the web crawler's scheduler, replacing the standard graph connectivity based priority. Our experiments on a web graph containing 900 million webpages from a commercial search engine's index demonstrate the efficiency of Crawl4LLM in obtaining high-quality pretraining data. With just 21% URLs crawled, LLMs pretrained on Crawl4LLM data reach the same downstream performances of previous crawls, significantly reducing the crawling waste and alleviating the burdens on websites. Our code is publicly available at https://github.com/cxcscmu/Crawl4LLM.

View Paper