Massive Web Archive
Billions of web pages captured since 2008, with monthly updates and metadata-rich records for AI and analytics.
Free, large-scale web data repository for AI training, research, and analytics
Founded in 2007, Common Crawl is a free and open repository of global web data.
It provides billions of archived web pages in raw formats like WARC and metadata, making it a powerful resource for AI training, Retrieval-Augmented Generation (RAG), and historical research.
While unstructured and requiring preprocessing, its scale and openness make it invaluable for developers, researchers, and data scientists.
Billions of web pages captured since 2008, with monthly updates and metadata-rich records for AI and analytics.
Access raw HTML, WARC files, and metadata including timestamps, MIME types, status codes, and digests for reproducibility.
Retrieve time-based captures of websites for AI benchmarks, reproducibility, and research into the evolution of the web.
Easily query and process data with Python using libraries like warcio for extracting and parsing WARC files.
Leverage crawl overviews, webgraph statistics, AI agent support, and open-source tools built on Common Crawl.
Common Crawl is not a plug-and-play dataset, but a raw, large-scale archive ideal for AI training, reproducibility, and historical research. For teams ready to parse and clean the data, it unlocks nearly two decades of open web history — for free.
Founded in 2007, Common Crawl is one of the largest collections of global internet history. The goal of this project is to create a free and open repository of web data — for anyone to use.
Especially in AI and machine learning, Common Crawl is a goldmine of information. You won’t get a tight spreadsheet of Amazon product data. Common Crawl gives you a seemingly endless list of snapshots — each tagged and labeled using raw HTML, Web ARChive (WARC) files and metadata.
Common Crawl’s data isn’t clean or tightly structured, but it still contains a wealth of information for both training and reference data. You need to know how to use it, but it could power both your training with vast datasets and Retrieval-Augmented Generation (RAG) pipelines with libraries of archive data accessible through simple, standardized queries.

The best way to understand Common Crawl is by using it. In this step-by-step tutorial, you’ll learn how to query Common Crawl’s index for metadata. Then, we’ll slightly tweak our script to download a webpage for viewing — when combined with simple web scraping, this is how you can extract your data.
This is a basic tutorial. We’re not running a production integration. With a basic understanding of Python, the code we use could easily be adapted for any unique data pipeline.
First, we need to install warcio for dealing with WARC files used by Common Crawl. Once you’ve got warcio, you’re ready to work with Common Crawl.
pip install warcio
Now, we’ll make a basic query for Common Crawl metadata. The following snippet comes directly from Common Crawl’s documentation. search_cc_index() is used to query the Common Crawl index for metadata. fetch_page_from_cc() uses that metadata to pull raw content from the index.
import requests
import json
# For parsing URLs:
from urllib.parse import quote_plus
# For parsing WARC records:
from warcio.archiveiterator import ArchiveIterator
# The URL of the Common Crawl Index server
SERVER = 'http://index.commoncrawl.org/'
# The Common Crawl index you want to query
INDEX_NAME = 'CC-MAIN-2024-33' # Replace with the latest index name
# The URL you want to look up in the Common Crawl index
target_url = 'commoncrawl.org/faq' # Replace with your target URL
# It’s advisable to use a descriptive User-Agent string when developing your own applications.
# This practice aligns with the conventions outlined in RFC 7231. Let's use this simple one:
myagent = 'cc-get-started/1.0 (Example data retrieval script; yourname@example.com)'
# Function to search the Common Crawl Index
def search_cc_index(url):
encoded_url = quote_plus(url)
index_url = f'{SERVER}{INDEX_NAME}-index?url={encoded_url}&output=json'
response = requests.get(index_url, headers={'user-agent': myagent})
print("Response from server:\r\n", response.text)
if response.status_code == 200:
records = response.text.strip().split('\n')
return [json.loads(record) for record in records]
else:
return None
# Function to fetch content from Common Crawl
def fetch_page_from_cc(records):
for record in records:
offset, length = int(record['offset']), int(record['length'])
s3_url = f'https://data.commoncrawl.org/{record["filename"]}'
# Define the byte range for the request
byte_range = f'bytes={offset}-{offset+length-1}'
# Send the HTTP GET request to the S3 URL with the specified byte range
response = requests.get(
s3_url,
headers={'user-agent': myagent, 'Range': byte_range},
stream=True
)
if response.status_code == 206:
# Use `stream=True` in the call to `requests.get()` to get a raw
# byte stream, because it's gzip compressed data
# Create an `ArchiveIterator` object directly from `response.raw`
# which handles the gzipped WARC content
stream = ArchiveIterator(response.raw)
for warc_record in stream:
if warc_record.rec_type == 'response':
return warc_record.content_stream().read()
else:
print(f"Failed to fetch data: {response.status_code}")
return None
print("No valid WARC record found in the given records")
return None
# Search the index for the target URL
records = search_cc_index(target_url)
if records:
print(f"Found {len(records)} records for {target_url}")
# Fetch the page content from the first record
content = fetch_page_from_cc(records)
if content:
print(f"Successfully fetched content for {target_url}")
# You can now process the 'content' variable as needed
# using something like Beautiful Soup, etc
else:
print(f"No records found for {target_url}")
After running the file, you should get an output similar to this. We get two JSON objects. Each one contains different fields that can be used for human or AI review: timestamp, url, mime, status… When you pair this metadata with raw HTML data, you’ve got a really powerful foundation for AI training.
In this case, we received two objects. The first one was a status 200 — the crawl was successful and behaved as expected. The second one was a status 301 in which the crawler got redirected.
Response from server:
{"urlkey": "org,commoncrawl)/faq", "timestamp": "20240806095848", "url": "https://commoncrawl.org/faq", "mime": "text/html", "mime-detected": "text/html", "status": "200", "digest": "WUECFJDI5IUY53PKTGUFN67AAJPKL32W", "length": "7390", "offset": "143072222", "filename": "crawl-data/CC-MAIN-2024-33/segments/1722640484318.27/warc/CC-MAIN-20240806095414-20240806125414-00893.warc.gz", "languages": "eng", "encoding": "UTF-8"}
{"urlkey": "org,commoncrawl)/faq", "timestamp": "20240806095848", "url": "https://commoncrawl.org/faq/", "mime": "text/html", "mime-detected": "text/html", "status": "301", "digest": "LGGT76I5HYSS6HQT36GPANELE4H7FWR7", "length": "590", "offset": "5510670", "filename": "crawl-data/CC-MAIN-2024-33/segments/1722640484318.27/crawldiagnostics/CC-MAIN-20240806095414-20240806125414-00142.warc.gz", "redirect": "https://commoncrawl.org/faq"}
Found 2 records for commoncrawl.org/faq
Successfully fetched content for commoncrawl.org/faq
Now, we’ll adjust our code to fetch and save a historical snapshot from Google. In the snippet below, we use a lambda function to sort our results by timestamp. This way, we can get the earliest archived copy.
# Sort results to get the earliest capture (based on timestamp)
records.sort(key=lambda r: r.get('timestamp', '99999999999999'))
Now, we’ll fetch the file and metadata. We convert our timestamp to an actual date string. We then use the write binary, wb, setting to write our content to an HTML file. Then, we print our metadata to the console.
from datetime import datetime
timestamp = record.get('timestamp', 'unknown')
date_str = datetime.strptime(timestamp, "%Y%m%d%H%M%S").strftime('%Y-%m-%d_%H-%M-%S')
filename = f"snapshot_{date_str}.html"
# Save HTML to a local file
with open(filename, 'wb') as f:
f.write(content)
# Print metadata for reproducibility or ML training logs
print(f"\nSaved HTML snapshot as: {filename}")
print(f"- URL: {record['url']}")
print(f"- Timestamp: {timestamp}")
print(f"- Status: {record.get('status')}")
print(f"- MIME Type: {record.get('mime')}")
print(f"- Digest: {record.get('digest')}")
Here’s our fully updated code. Feel free to run it yourself.
import requests
import json
from urllib.parse import quote_plus
from warcio.archiveiterator import ArchiveIterator
from datetime import datetime
SERVER = 'http://index.commoncrawl.org/'
INDEX_NAME = 'CC-MAIN-2013-20' # One of the oldest usable indexes
TARGET_URL = 'http://www.google.com'
MY_AGENT = 'cc-snapshot-demo/1.0 (contact@example.com)'
def search_cc_index(url):
encoded_url = quote_plus(url)
index_url = f'{SERVER}{INDEX_NAME}-index?url={encoded_url}&output=json'
response = requests.get(index_url, headers={'user-agent': MY_AGENT})
if response.status_code == 200:
records = response.text.strip().split('\n')
return [json.loads(record) for record in records]
else:
print("Failed index query.")
return None
def fetch_html_from_record(record):
offset, length = int(record['offset']), int(record['length'])
s3_url = f'https://data.commoncrawl.org/{record["filename"]}'
byte_range = f'bytes={offset}-{offset+length-1}'
response = requests.get(
s3_url,
headers={'user-agent': MY_AGENT, 'Range': byte_range},
stream=True
)
if response.status_code == 206:
for warc_record in ArchiveIterator(response.raw):
if warc_record.rec_type == 'response':
return warc_record.content_stream().read()
return None
def save_html_and_metadata(content, record):
timestamp = record.get('timestamp', 'unknown')
date_str = datetime.strptime(timestamp, "%Y%m%d%H%M%S").strftime('%Y-%m-%d_%H-%M-%S')
filename = f"google_snapshot_{date_str}.html"
with open(filename, 'wb') as f:
f.write(content)
print("\nSaved HTML snapshot as:", filename)
print(f"\nMetadata:")
print(f"- URL: {record['url']}")
print(f"- Timestamp: {timestamp}")
print(f"- HTTP Status: {record.get('status')}")
print(f"- MIME Type: {record.get('mime')}")
print(f"- Digest: {record.get('digest')}")
print(f"- File: {record['filename']}")
print(f"- Offset: {record['offset']}")
print(f"- Length: {record['length']}")
if __name__ == "__main__":
records = search_cc_index(TARGET_URL)
if not records:
print("No records found.")
exit()
print(f"Found {len(records)} records")
# Sort to ensure we get the earliest one
records.sort(key=lambda r: r.get('timestamp', '99999999999999'))
for record in records:
html = fetch_html_from_record(record)
if html:
save_html_and_metadata(html, record)
break
else:
print("No usable WARC response records found.")
When running the code, you’ll see output similar to the snippet below.
Found 265 records
Saved HTML snapshot as: google_snapshot_2013-05-18_06-50-56.html
Metadata:
- URL: http://www.google.com/
- Timestamp: 20130518065056
- HTTP Status: 200
- MIME Type: text/html
- Digest: L3JTXTPHJMSLVWRXQQ5BOBEN7HSXDP4L
- File: crawl-data/CC-MAIN-2013-20/segments/1368696381249/warc/CC-MAIN-20130516092621-00075-ip-10-60-113-184.ec2.internal.warc.gz
- Offset: 492275563
- Length: 5123
Open the raw HTML file in your browser. You’ll see a 2013 Google page. The page styling will be broken. The Google page is linked to a Cascading Stylesheet (CSS) file that it can no longer find.

Common Crawl doesn’t have Support Level Agreements (SLAs). Official help is minimal. Common Crawl is supported by a community of volunteers — people work to expand the Common Crawl ecosystem for free. A variety of other projects also help support Common Crawl. You can view their collaborators here.
Their “Getting Started” documentation contains some detailed examples showing how to interact with the Common Crawl index. Our example here began by copy and pasting their Python directly. We just changed a few subtle things so that we could download a snapshot for viewing.
Crawls use the CC-MAIN-YYYY-WW format. Think back to our Google snapshot CC-MAIN-2013-20. 2013 was the year. 20 was the week of our crawl. The crawl took place during week 20 of 2013. These snapshots get segmented and split into countless .warc.gz files. warc denotes the WARC files we talked about earlier. gz indicates a gzip compressed file — one of the most common formats used in the open source community.
Alongside their documentation, file structures and access methods Common Crawl offers a variety of other helpful resources.
It’s no secret. Common Crawl serves raw data. It’s a little different from structured APIs with JSON feeds — as you’ve already seen.
digest field to filter out duplicate content.stream=True when downloading segments. These files are huge, often gigabytes. They will not fit inside a single HTTP request.| Feature / Capability | Common Crawl | Bright Data Web Archive API | Internet Archive (Wayback Machine) | Archive-It |
|---|---|---|---|---|
| Access Method | Public S3 + HTTP | REST API | Web interface + limited API | Web-based portal (subscription required) |
| Cost | Free | Commercial / usage-based | Free for basic use | Paid tier for institutions |
| Update Frequency | Monthly | Near real-time or on-demand | Irregular | Custom crawl schedules |
| Targeting Capabilities | Broad domain + URL filtering via index | Precise, on-demand URL targeting | Time-based snapshots only | Domain- or institution-based scoping |
| Structured Output | Raw WARC, metadata and HTML | JSON, CSV, HTML, screenshots | HTML only | Depends on institutional configuration |
| Historical Depth | 2008–present | Depends on user setup | 1996–present | Depends on organization |
| Duplication Control | Manual filtering via digest | Built-in deduplication options | No deduplication | Manual or guided |
| Compliance / Annotation | None (user-responsible) | GDPR/CCPA tools, annotation, screenshot options | None | Curation possible |
| Best For | Training data, massive NLP, long-tail web archives | High-precision snapshots for RAG, AI agent flows | Historical browsing, link tracking | Institutional digital preservation |
Each of these tools solves a different problem. Common Crawl is ideal when you need scale and flexibility. Bright Data offers accuracy and control. The Wayback Machine gives the longest historical range — but not the format control and stability modern pipelines require.
Common Crawl isn’t the right tool for everyone. If you’re working in AI, large-scale analysis or historical web data, Common Crawl just might have what you need.
If you’re looking for curated datasets, Common Crawl isn’t for you. Common Crawl serves web snapshots by the petabyte.
If you’re building AI tools, benchmarking NLP models or extracting long-term patterns from the open web, there’s no better free source. Common Crawl is the raw internet — archived, accessible and waiting to be mined.
Common Crawl isn’t a shiny new buzz product. It’s a battle-tested icon of internet history. Don’t use it for neat, plug-and-play datasets. Common Crawl brings quantity over quality. Their metadata gives you a foundation to train on, but if you want a model with efficiency and expertise, it might be a good idea to further extract and transform the data.
Common Crawl gives you access to the world’s largest stockpile of web data. It’s not a plug-and-play solution for your AI needs, but it does provide real value. If you can parse and clean the data, you’ve got almost 20 years of internet data — for free.