Common Crawl

Common Crawl is a US 501(c)(3) nonprofit founded in 2007 by Gil Elbaz that publishes monthly open web-crawl datasets. As of 2026 the archive exceeds 10 petabytes and is the single largest source of training data for major large language models including GPT-3, LLaMA, and Claude precursors.

Common Crawl is a US-registered nonprofit organization that operates one of the largest open repositories of web crawl data in existence. Founded in 2007 by entrepreneur Gil Elbaz (previously of Applied Semantics, acquired by Google) and chaired in its early years by web pioneer Carl Malamud, it was designed to provide academic and commercial researchers with the kind of web-scale corpus previously available only to large search engines. Operationally, Common Crawl publishes a new crawl roughly once per month. Each release contains on the order of 2-3 billion web pages and is stored on Amazon S3 under the AWS Open Data Sponsorship Program, where it can be downloaded or queried in place at no cost. The cumulative archive exceeds 10 petabytes spanning 2008 to the present. Data is distributed in WARC, WAT, and WET formats — raw HTTP responses, structured metadata, and plain text respectively. Common Crawl was originally used mainly by academic researchers in web mining, computational linguistics, and search engine research, with thousands of papers citing it. Its role shifted dramatically after 2020, when OpenAI's GPT-3 paper revealed that a filtered version of Common Crawl accounted for roughly 60% of its weighted training tokens. Subsequent LLMs from Google, Meta, Anthropic, and others rely on Common Crawl directly or via derived datasets such as C4 (Colossal Clean Crawled Corpus), RefinedWeb, and The Pile. Funding shifted accordingly. In 2023 Anthropic and OpenAI each donated $250,000, and several other AI labs followed. This has prompted concerns about regulatory capture and editorial independence — the same nonprofit that scrapes publishers' content is now substantially funded by the companies that profit from training on it. A November 2025 *Atlantic* investigation reported that Common Crawl had misled publishers about respecting paywalls and honoring takedown requests, intensifying ongoing legal and ethical debate about web-scale scraping for AI training. Despite these controversies, Common Crawl remains foundational infrastructure: without it, the cost of bootstrapping a competitive frontier LLM would rise substantially, since no comparable open dataset of equivalent scale exists.

Common Crawl

Have insights to add?