AICC: Parse HTML Finer, Make Models Better -- A 7.3T AI-Ready Corpus Built by a Model-Based HTML Parser

Ren Ma, Jiantao Qiu, Chao Xu, Pei Chu, Kaiwen Liu, Pengli Ren, Yuan Qu, Jiahui Peng, Linfeng Hou, Mengjie Liu, Lindong Lu, Wenchang Ning, Jia Yu, Rui Min, Jin Shi, Haojiong Chen, Peng Zhang, Wenjian Zhang, Qian Jiang, Zengjie Hu, Guoqiang Yang, Zhenxiang Li

2025-11-25

AICC: Parse HTML Finer, Make Models Better -- A 7.3T AI-Ready Corpus Built by a Model-Based HTML Parser

Summary

This paper focuses on improving the quality of text extracted from websites for use in training large language models, arguing that better extraction is just as important as simply filtering out bad data.

What's the problem?

Currently, when information is pulled from the internet to train AI, the process of converting webpages (HTML) into plain text often messes up important parts of the page like code, formulas, and tables. Existing tools rely on simple rules to guess what's important, and they frequently make mistakes, leading to corrupted data that can hurt the performance of AI models.

What's the solution?

The researchers developed a new system called MinerU-HTML that uses a small but powerful language model to understand the *meaning* of different parts of a webpage. Instead of just looking at how dense the text is, it identifies things like code blocks and formulas and preserves them accurately when converting to a readable format like Markdown. This approach is also designed to be easily scaled up to handle massive amounts of web data.

Why it matters?

This work demonstrates that the way we extract data from the web has a significant impact on how well AI models learn. By creating a higher-quality dataset (AICC) using MinerU-HTML, they showed that models trained on it perform better than those trained on datasets created with older methods. This highlights the need to focus on improving extraction techniques, not just filtering, when building datasets for AI.

Abstract

While web data quality is crucial for large language models, most curation efforts focus on filtering and deduplication,treating HTML-to-text extraction as a fixed pre-processing step. Existing web corpora rely on heuristic-based extractors like Trafilatura, which struggle to preserve document structure and frequently corrupt structured elements such as formulas, codes, and tables. We hypothesize that improving extraction quality can be as impactful as aggressive filtering strategies for downstream performance. We introduce MinerU-HTML, a novel extraction pipeline that reformulates content extraction as a sequence labeling problem solved by a 0.6B-parameter language model. Unlike text-density heuristics, MinerU-HTML leverages semantic understanding and employs a two-stage formatting pipeline that explicitly categorizes semantic elements before converting to Markdown. Crucially, its model-based approach is inherently scalable, whereas heuristic methods offer limited improvement pathways. On MainWebBench, our benchmark of 7,887 annotated web pages, MinerU-HTML achieves 81.8\% ROUGE-N F1 compared to Trafilatura's 63.6\%, with exceptional structured element preservation (90.9\% for code blocks, 94.0\% for formulas). Using MinerU-HTML, we construct AICC (AI-ready Common Crawl), a 7.3-trillion token multilingual corpus from two Common Crawl snapshots. In controlled pretraining experiments where AICC and Trafilatura-extracted TfCC undergo identical filtering, models trained on AICC (62B tokens) achieve 50.8\% average accuracy across 13 benchmarks, outperforming TfCC by 1.08pp-providing direct evidence that extraction quality significantly impacts model capabilities. AICC also surpasses RefinedWeb and FineWeb on key benchmarks. We publicly release MainWebBench, MinerU-HTML, and AICC, demonstrating that HTML extraction is a critical, often underestimated component of web corpus construction.

View Paper