Skip to main content

Top open-source web scraping frameworks for AI and machine learning

A guide to the best open-source web scraping frameworks for AI. Learn which tools support dynamic sites, LLM prompts, and ML-friendly data formats

AI and machine learning models are only as good as the data they’re trained on. Web scraping is still one of the most practical ways to collect that data at scale. However, with so many tools out there, choosing the right one isn’t always straightforward.

This article will explore some of the best open-source web scraping tools. We’ll cover long-standing scraping frameworks, headless browser-based tools and newer AI-powered frameworks that combine scraping with large language models (LLMs). We’ll also cover each tool’s background, supported languages/frameworks and features.

What are open-source web scraping tools?

Open-source web scraping tools are freely available software libraries or frameworks that help you collect data from websites in an automated way. Since they’re open source, anyone can contribute to the codebase. These tools handle tasks like sending requests to web pages, navigating through site structures, parsing HTML or JavaScript-rendered content and storing the extracted data in formats like JSON, CSV or databases.

Scraping tools make it possible to gather raw data and turn it into structured datasets that can be used to train or fine-tune machine learning systems. In many cases, this is the only way to get domain-specific or up-to-date data that matches the real-world conditions a model is meant to operate in.

Key criteria for choosing a scraping framework

The web and, by extension, web scraping tools, have changed a lot over the years. Sites are more dynamic, protections are tighter and scraping today involves more than just pulling HTML. Here’s what you should consider when picking a scraping framework:

  • Programming Language Support: Use a framework that fits your existing stack. It should work naturally with the language you’re already using, so you’re not stuck hacking things together.
  • Target Page Rendering (Static vs. Dynamic): Some pages give you everything in the initial HTML. Others rely on JavaScript to build the content after the page loads. If you’re dealing with dynamic pages, you’ll need a tool that can run a real browser or handle JavaScript rendering.
  • AI Integration: LLMs have made it much easier to extract structured data. Instead of writing complicated selectors, you can use prompts to identify and extract exactly what you need, saving you a lot of time.
  • Proxy Rotation and CAPTCHA Handling: Rate limits, geo-blocking or CAPTCHAs often protect modern sites. Look for frameworks that support automatic IP rotation, residential and mobile proxies and CAPTCHA solving.

Now that we know what to look for, let’s get into the tools that stand out.

Scrapy

Scrapy

Scrapy is one of the oldest and most reliable open-source web scrapers in the Python ecosystem. It was originally developed by Insophia and first released in 2008. However, it has been actively maintained by Zyte (formerly Scrapinghub) since 2011. 

Scrapy has over 57,000 stars on GitHub and a large contributor base. It’s a solid choice for large web scraping projects and when you need to scrape mostly static web pages such as product catalogs, blog archives and documentation. It’s especially useful when your output is going into machine learning pipelines, retrieval-augmented generation (RAG) systems or search indexes.

Some of its features include:

  • Asynchronous, non-blocking architecture for high-performance
  • Built-in support for request retries, throttling and user-agent spoofing
  • First-class data pipeline support (JSON, CSV, databases, etc.)
  • Integrates well with workflow tools like Airflow and Prefect
  • Easy to extend with middleware and integrates with Playwright to scrape JavaScript-heavy websites.

How to Use Scrapy

Scrapy is designed around the concept of spiders, which are Python classes that define how to navigate a site, what data to extract and how to follow links. It gives you full control over crawling logic without being overly verbose. To get started, you’d need to first install Scrapy via pip:

pip install scrapy

To create your first spider, you can generate a new project:

scrapy startproject myproject
cd myproject

Then, inside the spiders/ folder, create a spider file like this:

# example.py
import scrapy

class QuotesSpider(scrapy.Spider):
    name = “quotes”
    start_urls = [“https://quotes.toscrape.com/”]
    def parse(self, response):
        for q in response.css(“div.quote”):
            yield {
                “text”: q.css(“span.text::text”).get(),
                “author”: q.css(“small.author::text”).get(),
            }
        next_page = response.css(“li.next a::attr(href)”).get()
        if next_page:
            yield response.follow(next_page, self.parse)

Then run it with:

scrapy runspider example.py -o quotes.json

This script crawls through all quote pages on the site, extracts quote text and author and saves everything in a JSON file. In more advanced setups, you can also enable proxy rotation, retries and structured output with just a few settings.

Playwright

Playwright

Playwright is a modern browser automation library maintained by Microsoft. It was released in 2020 as a more capable alternative to Selenium and Puppeteer. Playwright supports Chromium, Firefox and WebKit and works across Python, Node.js, Java and C#. As of 2025, it has over 74,000 GitHub stars and is widely used by scraping teams that need to extract data from dynamic and JavaScript-heavy sites.

Unlike traditional web scrapers, Playwright is built to behave like a real user. It can click through modals, handle redirects, render SPAs and work across different browsers. This makes it especially valuable when scraping lazy-loading feeds, interactive dashboards or other dynamic content.

Playwright is equipped with several key features, such as:

  • Unified support for Chromium, Firefox and WebKit
  • Automatic handling of AJAX, network timing and navigation delays
  • Built-in request interception (block resources, inject headers/tokens)
  • Full control over proxies, user-agent rotation and mobile emulation
  • Works headless or headful, runs locally or in containers
  • Native support for multi-tab and parallel browsing contexts

How to use Playwright

Playwright supports multiple languages and frameworks. To use its Python library, first install Playwright:

pip install playwright
playwright install

You can now create a simple example.py scraping script like this:

from playwright.sync_api import sync_playwright
with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    page.goto(“https://example.com”)
    titles = [h.inner_text() for h in page.query_selector_all(“h1”)]
    print(titles)
    browser.close()

Then run it with:

python example.py

This script launches a headless browser, loads the page, extracts all <h1> text, prints the results and closes the browser. In real-world use, you can also extend Playwright to handle things like request interception, retries, proxy rotation and complete page interaction flows.

Stagehand (by Browserbase)

Stagehand

Stagehand is an AI-augmented browser automation framework built on top of Playwright. It lets you mix traditional code with natural language commands for scraping and web automation.

Compared to standard Playwright or other web scraping tools that require hardcoded selectors and sequences, Stagehand can delegate parts of a scraping task to an LLM. For example, it can click buttons or extract content based on a written prompt. This hybrid model reduces script breakage and lets you build smarter scrapers.

Its capabilities include:

  • Built-in support for LLM-based interaction (OpenAI, Anthropic, Gemini, etc.)
  • Natural language commands (e.g., page.act(“click the login button”))
  • Schema-based data extraction via prompt + JSON spec
  • Caching and preview tools to reduce LLM calls and improve reliability
  • Full compatibility with Playwright’s API — you can use both side-by-side
  • Works with Browserbase’s hosted infrastructure or self-hosted setups

How to use Stagehand

Install it via npm:

npx create-browser-app

During the installation process, you can choose a preferred LLM model and other preferred configuration options. Here’s a basic TypeScript example that visits a page, clicks a repo and extracts structured data:

import { Stagehand } from “@browserbasehq/stagehand”;
const stagehand = new Stagehand(/* config */);
await stagehand.init();
await stagehand.page.goto(“https://github.com/browserbase”);
await stagehand.page.act(“click on the Stagehand repo”);
const { author, title } = await stagehand.page.extract({
  instruction: “extract the author and title of the PR”,
  schema: {
    author: “string”,
    title: “string”,
  },
});

As shown in this example, Stagehand lets you interact with pages and extract data using natural language prompts. You describe the task, for example, “click on the Stagehand repo” or “extract the author and title” and define a schema to shape the output. This way, the data is immediately usable in downstream systems like search, RAG or analytics.

Crawl4AI

Crawl4AI

Crawl4AI is an open-source and LLM-friendly web scraping solution that launched in 2024 and has quickly gained traction with over 46k stars on GitHub. It is designed to generate clean Markdown and structured JSON outputs specifically for AI pipelines like RAG and fine-tuning. It combines fast browser crawling, optional Playwright rendering, heuristic extraction and AI-guided structured parsing in one tool.

Crawl4AI is ideal for tasks like converting blog archives into training-ready text, scraping product data from dynamic sites or building datasets for AI agents, all while handling dynamic content, proxies and session reuse.

Some of its notable features are:

  • Clean Markdown output optimized for AI ingestion
  • Web data extraction via CSS/XPath or LLM-based prompts
  • High-speed asynchronous crawling with parallelism
  • Full browser automation support (Playwright backend) for JS-heavy pages
  • Session management, proxy support and stealth mode
  • Content chunking/filtering strategies (e.g., BM25, pruning)
  • CLI and Python library with Docker deployment options

How to use Crawl4AI

First, install Crawl4AI via pip:

pip install -U crawl4ai
crawl4ai-setup

Below is an example that scrapes data from a dynamic page using Playwright and extracts content using an LLM prompt:

import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
from crawl4ai.extraction_strategy import LLMExtractionStrategy
async def main():
    llm_strategy = LLMExtractionStrategy(
        provider=”openai”,
        instruction=”Extract product name, price, and rating”,
        schema={
            “name”: “string”,
            “price”: “string”,
            “rating”: “string”
        }
    )
    run_config = CrawlerRunConfig(
        browser_config=BrowserConfig(headless=True),
        extraction_strategy=llm_strategy
    )
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(“https://example.com/products”, config=run_config)
        print(result.extracted_content)
asyncio.run(main())

This example runs a full browser session, interprets the page using an LLM and returns clean, structured data without manually writing selectors.

ScrapeGraphAI

ScrapeGraphAI

ScrapeGraphAI is an open-source and AI-native web scraping library for Python and Node.js. It combines graph-based pipelines with LLMs to extract structured data using plain English prompts. Launched in 2024, it quickly gained over 20k GitHub stars, securing itself as a leader in the AI scraper space.

ScrapeGraphAI is built for users who want a no-selector-needed experience. You declare what you want and the AI figures out the rest. Behind the scenes, it integrates Playwright for dynamic page rendering and outputs clean JSON or Markdown, which is perfect for RAG systems and ML pipelines without manual HTML parsing or extraction logic.

Among its core features are:

  • Scraping data and web pages using natural language prompts instead of writing CSS or XPath selectors
  • Graph-based workflows for handling multi-step or multi-page scraping tasks
  • Integration with LLMs like OpenAI, Ollama and Gemini for intelligent automation
  • JavaScript-heavy pages rendered using Playwright under the hood
  • Outputs returned as structured JSON or Markdown-friendly text
  • Offers Python and Node.js SDKs, along with API access and a browser extension for quick setup

How to use ScrapeGraphAI

To use the ScrapeGraphAI Python library, you can install it with pip using the following command:

pip install scrapegraphai
playwright install

Once installed, here’s a simple pipeline to extract product details using natural prompts:

from scrapegraphai.graphs import SmartScraperGraph
graph_config = {
    “llm”: {“api_key”: “YOUR_OPENAI_KEY”, “model”: “gpt-4o-mini”},
    “headless”: True,
}
smart = SmartScraperGraph(
    prompt=”Extract product name, price, and rating from this page”,
    source=”https://example.com/products”,
    config=graph_config
)
result = smart.run()
print(result)  # JSON dict with name/price/rating fields

Behind the scenes, it launches a Playwright browser, navigates the page, prompts the LLM to find data and outputs structured results, all with zero selector coding.

Comparing the best web scraping tools for data extraction

Based on everything we’ve covered so far, here’s a side-by-side comparison of the top open-source web scraping tools. This should help you quickly see where each one shines and what trade-offs to expect.

ToolLanguage(s)Dynamic Content SupportAI IntegrationIdeal Use CaseOutput FormatsHeadless Browser Support
ScrapyPythonLimited (via integration)None (manual only)Static sites, pipelines, search indexingJSON, CSV, DBsVia integration (e.g., Playwright)
PlaywrightPython, Node.js, Java, C#FullNone (manual only)JavaScript-heavy, interactive sitesAny (custom logic)Native
StagehandNode.js (TypeScript)FullBuilt-in LLM promptsSmart scraping with minimal codeJSON (via schema)Native (via Playwright)
Crawl4AIPythonFullBuilt-in LLM extractionAI dataset generation, clean outputMarkdown, JSONNative (Playwright backend)
ScrapeGraphAIPython, Node.jsFullBuilt-in LLM promptsZero-selector scraping, RAG-ready outputMarkdown, JSONNative (Playwright backend)

Conclusion

In this article, you’ve seen some of the best open-source tools for web scraping. We explored traditional frameworks like Scrapy and Playwright and modern AI-powered solutions like Stagehand, Crawl4AI and ScrapeGraphAI. Each tool has its strengths, depending on your stack and what kind of data you’re after.

That said, scraping at scale often takes more than just writing code. You’ll need reliable proxies, CAPTCHA handling and infrastructure that can handle heavy loads without breaking. Many teams choose to skip the operational overhead entirely and use fully managed solutions to handle everything from extraction to delivery.