Founded in 2007, Common Crawl is one of the largest collections of global internet history. The goal of this project is to create a free and open repository of web data — for anyone to use.
Especially in AI and machine learning, Common Crawl is a goldmine of information. You won’t get a tight spreadsheet of Amazon product data. Common Crawl gives you a seemingly endless list of snapshots — each tagged and labeled using raw HTML, Web ARChive (WARC) files and metadata.
Common Crawl’s data isn’t clean or tightly structured, but it still contains a wealth of information for both training and reference data. You need to know how to use it, but it could power both your training with vast datasets and Retrieval-Augmented Generation (RAG) pipelines with libraries of archive data accessible through simple, standardized queries.

Data types, scale and integration workflow basics
The best way to understand Common Crawl is by using it. In this step-by-step tutorial, you’ll learn how to query Common Crawl’s index for metadata. Then, we’ll slightly tweak our script to download a webpage for viewing — when combined with simple web scraping, this is how you can extract your data.
This is a basic tutorial. We’re not running a production integration. With a basic understanding of Python, the code we use could easily be adapted for any unique data pipeline.
Getting started
First, we need to install warcio for dealing with WARC files used by Common Crawl. Once you’ve got warcio, you’re ready to work with Common Crawl.
pip install warcio
Now, we’ll make a basic query for Common Crawl metadata. The following snippet comes directly from Common Crawl’s documentation. search_cc_index() is used to query the Common Crawl index for metadata. fetch_page_from_cc() uses that metadata to pull raw content from the index.
import requests
import json
# For parsing URLs:
from urllib.parse import quote_plus
# For parsing WARC records:
from warcio.archiveiterator import ArchiveIterator
# The URL of the Common Crawl Index server
SERVER = 'http://index.commoncrawl.org/'
# The Common Crawl index you want to query
INDEX_NAME = 'CC-MAIN-2024-33' # Replace with the latest index name
# The URL you want to look up in the Common Crawl index
target_url = 'commoncrawl.org/faq' # Replace with your target URL
# It’s advisable to use a descriptive User-Agent string when developing your own applications.
# This practice aligns with the conventions outlined in RFC 7231. Let's use this simple one:
myagent = 'cc-get-started/1.0 (Example data retrieval script; yourname@example.com)'
# Function to search the Common Crawl Index
def search_cc_index(url):
encoded_url = quote_plus(url)
index_url = f'{SERVER}{INDEX_NAME}-index?url={encoded_url}&output=json'
response = requests.get(index_url, headers={'user-agent': myagent})
print("Response from server:\r\n", response.text)
if response.status_code == 200:
records = response.text.strip().split('\n')
return [json.loads(record) for record in records]
else:
return None
# Function to fetch content from Common Crawl
def fetch_page_from_cc(records):
for record in records:
offset, length = int(record['offset']), int(record['length'])
s3_url = f'https://data.commoncrawl.org/{record["filename"]}'
# Define the byte range for the request
byte_range = f'bytes={offset}-{offset+length-1}'
# Send the HTTP GET request to the S3 URL with the specified byte range
response = requests.get(
s3_url,
headers={'user-agent': myagent, 'Range': byte_range},
stream=True
)
if response.status_code == 206:
# Use `stream=True` in the call to `requests.get()` to get a raw
# byte stream, because it's gzip compressed data
# Create an `ArchiveIterator` object directly from `response.raw`
# which handles the gzipped WARC content
stream = ArchiveIterator(response.raw)
for warc_record in stream:
if warc_record.rec_type == 'response':
return warc_record.content_stream().read()
else:
print(f"Failed to fetch data: {response.status_code}")
return None
print("No valid WARC record found in the given records")
return None
# Search the index for the target URL
records = search_cc_index(target_url)
if records:
print(f"Found {len(records)} records for {target_url}")
# Fetch the page content from the first record
content = fetch_page_from_cc(records)
if content:
print(f"Successfully fetched content for {target_url}")
# You can now process the 'content' variable as needed
# using something like Beautiful Soup, etc
else:
print(f"No records found for {target_url}")
After running the file, you should get an output similar to this. We get two JSON objects. Each one contains different fields that can be used for human or AI review: timestamp, url, mime, status… When you pair this metadata with raw HTML data, you’ve got a really powerful foundation for AI training.
In this case, we received two objects. The first one was a status 200 — the crawl was successful and behaved as expected. The second one was a status 301 in which the crawler got redirected.
- Status 200: The crawler successfully retrieved the page content. The page is available just as it was during the snapshot.
- Status 301: The crawler received a redirect. This doesn’t give us page data. However, we do know that as early as 2013, Google was redirecting bot traffic. We get real insight into how bot traffic was handled on the web of yesteryear.
Response from server:
{"urlkey": "org,commoncrawl)/faq", "timestamp": "20240806095848", "url": "https://commoncrawl.org/faq", "mime": "text/html", "mime-detected": "text/html", "status": "200", "digest": "WUECFJDI5IUY53PKTGUFN67AAJPKL32W", "length": "7390", "offset": "143072222", "filename": "crawl-data/CC-MAIN-2024-33/segments/1722640484318.27/warc/CC-MAIN-20240806095414-20240806125414-00893.warc.gz", "languages": "eng", "encoding": "UTF-8"}
{"urlkey": "org,commoncrawl)/faq", "timestamp": "20240806095848", "url": "https://commoncrawl.org/faq/", "mime": "text/html", "mime-detected": "text/html", "status": "301", "digest": "LGGT76I5HYSS6HQT36GPANELE4H7FWR7", "length": "590", "offset": "5510670", "filename": "crawl-data/CC-MAIN-2024-33/segments/1722640484318.27/crawldiagnostics/CC-MAIN-20240806095414-20240806125414-00142.warc.gz", "redirect": "https://commoncrawl.org/faq"}
Found 2 records for commoncrawl.org/faq
Successfully fetched content for commoncrawl.org/faq
Fetching and saving a historical snapshot
Now, we’ll adjust our code to fetch and save a historical snapshot from Google. In the snippet below, we use a lambda function to sort our results by timestamp. This way, we can get the earliest archived copy.
# Sort results to get the earliest capture (based on timestamp)
records.sort(key=lambda r: r.get('timestamp', '99999999999999'))
Now, we’ll fetch the file and metadata. We convert our timestamp to an actual date string. We then use the write binary, wb, setting to write our content to an HTML file. Then, we print our metadata to the console.
from datetime import datetime
timestamp = record.get('timestamp', 'unknown')
date_str = datetime.strptime(timestamp, "%Y%m%d%H%M%S").strftime('%Y-%m-%d_%H-%M-%S')
filename = f"snapshot_{date_str}.html"
# Save HTML to a local file
with open(filename, 'wb') as f:
f.write(content)
# Print metadata for reproducibility or ML training logs
print(f"\nSaved HTML snapshot as: {filename}")
print(f"- URL: {record['url']}")
print(f"- Timestamp: {timestamp}")
print(f"- Status: {record.get('status')}")
print(f"- MIME Type: {record.get('mime')}")
print(f"- Digest: {record.get('digest')}")
Here’s our fully updated code. Feel free to run it yourself.
import requests
import json
from urllib.parse import quote_plus
from warcio.archiveiterator import ArchiveIterator
from datetime import datetime
SERVER = 'http://index.commoncrawl.org/'
INDEX_NAME = 'CC-MAIN-2013-20' # One of the oldest usable indexes
TARGET_URL = 'http://www.google.com'
MY_AGENT = 'cc-snapshot-demo/1.0 (contact@example.com)'
def search_cc_index(url):
encoded_url = quote_plus(url)
index_url = f'{SERVER}{INDEX_NAME}-index?url={encoded_url}&output=json'
response = requests.get(index_url, headers={'user-agent': MY_AGENT})
if response.status_code == 200:
records = response.text.strip().split('\n')
return [json.loads(record) for record in records]
else:
print("Failed index query.")
return None
def fetch_html_from_record(record):
offset, length = int(record['offset']), int(record['length'])
s3_url = f'https://data.commoncrawl.org/{record["filename"]}'
byte_range = f'bytes={offset}-{offset+length-1}'
response = requests.get(
s3_url,
headers={'user-agent': MY_AGENT, 'Range': byte_range},
stream=True
)
if response.status_code == 206:
for warc_record in ArchiveIterator(response.raw):
if warc_record.rec_type == 'response':
return warc_record.content_stream().read()
return None
def save_html_and_metadata(content, record):
timestamp = record.get('timestamp', 'unknown')
date_str = datetime.strptime(timestamp, "%Y%m%d%H%M%S").strftime('%Y-%m-%d_%H-%M-%S')
filename = f"google_snapshot_{date_str}.html"
with open(filename, 'wb') as f:
f.write(content)
print("\nSaved HTML snapshot as:", filename)
print(f"\nMetadata:")
print(f"- URL: {record['url']}")
print(f"- Timestamp: {timestamp}")
print(f"- HTTP Status: {record.get('status')}")
print(f"- MIME Type: {record.get('mime')}")
print(f"- Digest: {record.get('digest')}")
print(f"- File: {record['filename']}")
print(f"- Offset: {record['offset']}")
print(f"- Length: {record['length']}")
if __name__ == "__main__":
records = search_cc_index(TARGET_URL)
if not records:
print("No records found.")
exit()
print(f"Found {len(records)} records")
# Sort to ensure we get the earliest one
records.sort(key=lambda r: r.get('timestamp', '99999999999999'))
for record in records:
html = fetch_html_from_record(record)
if html:
save_html_and_metadata(html, record)
break
else:
print("No usable WARC response records found.")
When running the code, you’ll see output similar to the snippet below.
Found 265 records
Saved HTML snapshot as: google_snapshot_2013-05-18_06-50-56.html
Metadata:
- URL: http://www.google.com/
- Timestamp: 20130518065056
- HTTP Status: 200
- MIME Type: text/html
- Digest: L3JTXTPHJMSLVWRXQQ5BOBEN7HSXDP4L
- File: crawl-data/CC-MAIN-2013-20/segments/1368696381249/warc/CC-MAIN-20130516092621-00075-ip-10-60-113-184.ec2.internal.warc.gz
- Offset: 492275563
- Length: 5123
Open the raw HTML file in your browser. You’ll see a 2013 Google page. The page styling will be broken. The Google page is linked to a Cascading Stylesheet (CSS) file that it can no longer find.

Community support, documentation and processing tips
Common Crawl doesn’t have Support Level Agreements (SLAs). Official help is minimal. Common Crawl is supported by a community of volunteers — people work to expand the Common Crawl ecosystem for free. A variety of other projects also help support Common Crawl. You can view their collaborators here.
Documentation and dataset structure
Their “Getting Started” documentation contains some detailed examples showing how to interact with the Common Crawl index. Our example here began by copy and pasting their Python directly. We just changed a few subtle things so that we could download a snapshot for viewing.
Crawls use the CC-MAIN-YYYY-WW format. Think back to our Google snapshot CC-MAIN-2013-20. 2013 was the year. 20 was the week of our crawl. The crawl took place during week 20 of 2013. These snapshots get segmented and split into countless .warc.gz files. warc denotes the WARC files we talked about earlier. gz indicates a gzip compressed file — one of the most common formats used in the open source community.
Alongside their documentation, file structures and access methods Common Crawl offers a variety of other helpful resources.
- Crawl Overview: View crawl summaries and overviews dating all the way back to 2008.
- Webgraph Statistics: View webgraph statistics to learn about Common Crawl’s operational history.
- Crawl Statistics: View individual stats from each crawl — number of pages, domain distribution and more.
- Errata: A comprehensive guide to errors encountered using the Common Crawl platform.
- AI Agent: Chat in real-time with Common Crawl’s AI agent for assistance with data retrieval. Designed to help you use Common Crawl.
- Blog: Read the latest stats, news and analysis.
- Examples: Use open source tools built on top of Common Crawl. Someone else already did the heavy lifting for integration — to make Common Crawl easier for you.
Processing tips
It’s no secret. Common Crawl serves raw data. It’s a little different from structured APIs with JSON feeds — as you’ve already seen.
- Use the
digestfield to filter out duplicate content. - Always verify the content. Don’t just assume that mime fields are accurate.
- Expect broken pages and malformed HTML. Almost every webpage is linked to other files. These files no longer exist.
- Use
stream=Truewhen downloading segments. These files are huge, often gigabytes. They will not fit inside a single HTTP request.
Comparative review with major archive APIs/providers (workflow, data access and fit)
| Feature / Capability | Common Crawl | Bright Data Web Archive API | Internet Archive (Wayback Machine) | Archive-It |
|---|---|---|---|---|
| Access Method | Public S3 + HTTP | REST API | Web interface + limited API | Web-based portal (subscription required) |
| Cost | Free | Commercial / usage-based | Free for basic use | Paid tier for institutions |
| Update Frequency | Monthly | Near real-time or on-demand | Irregular | Custom crawl schedules |
| Targeting Capabilities | Broad domain + URL filtering via index | Precise, on-demand URL targeting | Time-based snapshots only | Domain- or institution-based scoping |
| Structured Output | Raw WARC, metadata and HTML | JSON, CSV, HTML, screenshots | HTML only | Depends on institutional configuration |
| Historical Depth | 2008–present | Depends on user setup | 1996–present | Depends on organization |
| Duplication Control | Manual filtering via digest | Built-in deduplication options | No deduplication | Manual or guided |
| Compliance / Annotation | None (user-responsible) | GDPR/CCPA tools, annotation, screenshot options | None | Curation possible |
| Best For | Training data, massive NLP, long-tail web archives | High-precision snapshots for RAG, AI agent flows | Historical browsing, link tracking | Institutional digital preservation |
Each of these tools solves a different problem. Common Crawl is ideal when you need scale and flexibility. Bright Data offers accuracy and control. The Wayback Machine gives the longest historical range — but not the format control and stability modern pipelines require.
Pros, cons and best-fit use cases
Pros and cons
Pros
- Free and open-source: No paywalls or licensing.
- Massive scale: Billions of web pages across decades of internet history.
- Metadata: Digest, timestamp, MIME type, status code and more.
- Great for training AI: Unstructured data sources with tagged metadata provide you with a stockpile of semi-structured datasets to train on.
Cons
- Official support: No SLAs, no direct help unless you engage the community.
- Learning curve: Requires custom logic to parse and prepare data. Not unlike traditional web scraping.
- Malformed Pages: Many pages rely on external resources. These external resources are broken.
Use cases
Common Crawl isn’t the right tool for everyone. If you’re working in AI, large-scale analysis or historical web data, Common Crawl just might have what you need.
- AI Training: AI pretraining usually relies on stockpiles of data. That’s exactly what you get with Common Crawl.
- Reproducibility: Live websites change all the time. Common Crawl’s pages are frozen in time. These pages are fixed variables for testing and benchmarks.
- Research: Common Crawl lets you view a time-lapse of how the internet evolved. This is good for data teams and historians alike.
Bottom-line evaluation and recommendations
If you’re looking for curated datasets, Common Crawl isn’t for you. Common Crawl serves web snapshots by the petabyte.
If you’re building AI tools, benchmarking NLP models or extracting long-term patterns from the open web, there’s no better free source. Common Crawl is the raw internet — archived, accessible and waiting to be mined.
- Use Common Crawl for training AI models, building custom RAG pipelines or researching web history.
- Don’t use if you need real-time updates, clean data and guaranteed uptime.
- Pair it with a scraping engine, parsing pipeline or archive API for best results.
Conclusion
Common Crawl isn’t a shiny new buzz product. It’s a battle-tested icon of internet history. Don’t use it for neat, plug-and-play datasets. Common Crawl brings quantity over quality. Their metadata gives you a foundation to train on, but if you want a model with efficiency and expertise, it might be a good idea to further extract and transform the data.
Common Crawl gives you access to the world’s largest stockpile of web data. It’s not a plug-and-play solution for your AI needs, but it does provide real value. If you can parse and clean the data, you’ve got almost 20 years of internet data — for free.