Skip to main content

From raw HTML to AI-ready insights: The intelligent data transformation capabilities of modern web infrastructure

In this guide, we'll go over the differences between traditional hardcoded pipelines and the emerging intelligent pipelines of the future.

Data pipelines are the backbone of modern web data infrastructure. For years, businesses relied on hardcoded, rule-based pipelines to scrape and process information from websites. While effective at the time, these traditional methods are brittle, hard to scale, and often break when websites change.
Today, intelligent data pipelines powered by AI are transforming how raw HTML gets converted into structured, AI-ready insights. In this guide, we’ll explore the key differences between traditional and intelligent pipelines, why HTML data needs transformation, and how the future of web data extraction is evolving.

By the end of this guide, you’ll be able to answer the following questions.

  • What is HTML?
  • Why isn’t HTML data suitable for analysis by people or AI models?
  • How does HTML data get transformed into actionable insights?
  • What are the differences between traditional and intelligent data pipelines?
  • Where are web development and data extraction heading in the future?

What is raw HTML?

Websites are rendered using HyperText Markup Language (HTML). HTML is meant to be read by a browser and a web developer, not regular people or AI models. When represented in HTML, data objects are structured loosely with almost no context. The layout is meant for the browser to read and render a product on the page — most of what you see is instructions for a machine, not readable data.

Take a look at the HTML object below.

<article class="product_pod">
    <div class="image_container">
    <a href="a-light-in-the-attic_1000/index.html">
        <img src="../media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg" alt="A Light in the Attic" class="thumbnail">
    </a>
    </div>
    <p class="star-rating Three">
        <i class="icon-star"></i>
        <i class="icon-star"></i>
        <i class="icon-star"></i>
        <i class="icon-star"></i>
        <i class="icon-star"></i>
    </p>
    <h3>
        <a href="a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...
        </a>
    </h3>
    <div class="product_price">
        <p class="price_color">£51.77</p>
        <p class="instock availability">
            <i class="icon-ok"></i>In stock
        </p>
        <form>
            <button type="submit" class="btn btn-primary btn-block" data-loading-text="Adding...">Add to basket</button>
        </form>
</div>
</article>

This HTML object is from https://books.toscrape.com. You can see how it’s actually rendered in the image below. Of all the text you see in the object, only a few pieces of relevant data are inside this massive block of HTML.

  • Book title
  • Image
  • Rating
  • Price
  • Availability
Actual browser rendering of "A Light in the Attic" from Books to Scrape, the raw HTML data you just saw.
Actual browser rendering of “A Light in the Attic” from Books to Scrape, the raw HTML data you just saw.

HTML data contains tons of text used for markup and rendering by the browser. Digging through HTML is inefficient and confusing even to the trained eye. When AI models look at this information, they can pick things out but data often gets missed and relationships are almost never inferred.

Why isn’t it suitable for AI?

After viewing this book as raw HTML, it’s easy to see why it’s not suitable for humans. To understand why it doesn’t work for AI, we need a better understanding of how AI models process data. At the heart of this is a process called vectorization. A vector is essentially a list. Take a look at the vector below.

["i", "am", "a", "vector"]

Vectors can even hold other vectors. Take a look at the vector below. We now have a two dimensional vector.

[
    ["i", "am", "a", "vector"],
    ["i", "am", "another", "vector"]
]

Even tabular data — CSV files, SQL tables and dataframes are essentially just vectors of vectors. In this next example, you can likely see the beginnings of a table.

[
    ["word1", "word2", "word3", "word4"],
    ["i", "am", "a", "vector"],
    ["i", "am", "another", "vector"]
]

When we add some rendering sorcery, this same vector of vectors becomes a table familiar to us humans. Conceptually, we’ve still got a vector holding multiple vectors here. The first vector on the list represents our column names. Each subsequent one represents a row in the table.

“word1”“word2”“word3”“word4”
“i”“am”“a”“vector”
“i”“am”“another”“vector”

AI models like vectorized data. Vectors allow models to quickly index data and highlight patterns — this is very inefficient and often impossible when using raw HTML. Where we see a table, machines see a vector of vectors and they can analyze its information very quickly.

Intelligent capabilities of modern web data infrastructure

Modern web data infrastructure provides numerous benefits we couldn’t even imagine in web scraping just a few years ago. Newer AI agents function with near-autonomy. What was once 50 lines of code is now executed after a short conversational prompt. Brittle selectors are now replaced by semantic logic. Retries can now be executed with dynamic logic changes — first try: “some-selector”, next try: “some-other-selector.” Model Context Protocol (MCP) servers allow AI agents to control full toolsets and execute instructions using them.

Take a look at the main benefits that intelligent extraction brings to the industry.

  • Self-healing scripts: When something goes wrong during extraction, AI agents can try other methods and use semantic understanding to target the proper data.
  • Tool plug-ins: AI agents can plug into external tools and control them with a degree of autonomy. Tell the agent what to do, it figures everything out from there.
  • Flexible schema: Generative models can follow almost any schema you give them. Feed any data into the model and get ready-to-use custom data structures in the output.
  • Natural language processing (NLP): Agents are now programmed using natural language. Words are all you need to build an effective data pipeline.

Common use cases of AI-ready web data

AI-ready pipelines are already powering innovation in a variety of global industries. Once raw HTML data gets transformed into machine readable structured data, the possibilities compound exponentially.

  • E-commerce: Products, reviews and price history can now be extracted at scale with robust error handling and self-healing logic.
  • Financial analysis: Market news and asset data can now be extracted dynamically. Then, they can be fed into machines with semantic understanding to generate uniformly structured outputs.
  • Healthcare: Clinical research and academic papers are now being mined to accelerate drug discovery and monitor trends in the public health sector.
  • Intelligence: Competitive intelligence can now analyze data from different sources to understand market trends and governments can analyze multiple sources for international intelligence decisions.

Architecture of intelligent data transformation pipelines

Data transformation workflow
Data transformation workflow

Now, let’s go through and take a look at the architecture behind data transformation. We’ll crawl pages for extraction. Then, we’ll transform and save our AI-ready data for usage. We’ll outline how this process goes using traditional extraction methods with Python and also through next-gen methods using Claude.

Manual pipeline

Here’s the code for our manual pipeline. While we’re below our target number of pages, we get a page and extract its content. Then, we clean and format the data before adding it to our scraped_books. Once we’ve cycled through all the pages, we save the data to a JSON file.

import requests
from bs4 import BeautifulSoup
import json

PAGES_TO_SCRAPE = 10

page_number = 1

scraped_books = []

while page_number <= PAGES_TO_SCRAPE:

    #get the page
    response = requests.get(f"https://books.toscrape.com/catalogue/page-{page_number}.html")
    #initiate the parser
    soup = BeautifulSoup(response.text, "html.parser")

    #find the books on the page
    books = soup.select("article[class='product_pod']")

    for book in books:
        #find the image and extract the title from its alt text
        img = book.find("img")
        title = img.get("alt")
        #find and extract the price text
        price = book.select_one("p[class='price_color']").text

        extracted_data = {
            "title": title,
            #remove the first two characters from the price -- we want a float, not a string
            "price": price[2:]
        }
        scraped_books.append(extracted_data)
    #on to the next page
    page_number+=1

with open("manually-scraped-books.json", "w") as file:
    json.dump(scraped_books, file, indent=4, ensure_ascii=False)

Autonomous crawl and extraction with Claude

Our prompt to Claude is far smaller than the manual code. Please note that we used the MCP server from Bright Data to give Claude access to external tooling. Currently, Bright Data offers a free plan allowing you to make up to 5,000 requests using their MCP server — this makes it really easy to test it out. You can find a comprehensive list of MCP servers for all sorts of different purposes on GitHub.

Use your tools to crawl the first 10 pages of https://books.toscrape.com. I need the title (string) and price (float) of each book. Then output them all in a json file that looks like the one here.
    [
        {"title": "A Light in the Attic", "price": "51.77"},
        {"title": "Tipping the Velvet", "price": "53.74"},
    ]

Below, you can see what the prompt looks like when used with Claude. Claude accepts the task and begins to plan out the job.

Our initial prompt and Claude getting started.
Our initial prompt and Claude getting started.

Crawling

Here’s our basic crawling script. We set a page limit using PAGES_TO_SCRAPE. Then, page_number keeps track of which page we’re on. We use a list, scraped_books, to hold our extracted data. Once we’ve finished processing a page, we increment page_number and move on to the next one. When our limit has been hit, the loop exits.

import requests
from bs4 import BeautifulSoup
import json

PAGES_TO_SCRAPE = 10

page_number = 1

scraped_books = []

while page_number <= PAGES_TO_SCRAPE:

    #get the page
    response = requests.get(f"https://books.toscrape.com/catalogue/page-{page_number}.html")
    
    """
    Extraction logic goes here...
    """
    
    #on to the next page
    page_number+=1

As you can see in the figure below, Claude already has the crawl planned out. It starts by scraping pages one and two. Then, it continues to call scrape_as_markdown until all the pages have been fetched.

Claude continues calling scrape_as_markdown until it has gotten all the pages.
Claude continues calling scrape_as_markdown until it has gotten all the pages.

Extraction

Now, take a look at our extraction logic in Python. First, we need to create a BeautifulSoup object out of our response. We can then use soup.select() to find all the books on the page and parse the book object further. img.get() returns the alt text of the image — this yields the book title. We extract the price of the book using its text.

#initiate the parser
soup = BeautifulSoup(response.text, "html.parser")
#find the books on the page
books = soup.select("article[class='product_pod']")

for book in books:
    #find the image and extract the title from its alt text
    img = book.find("img")
    title = img.get("alt")
    #find and extract the price text
    price = book.select_one("p[class='price_color']").text

Claude writes an extraction function and tests it on the first page. Then, it decides to continue on with the crawl.

Claude testing its extraction.
Claude testing its extraction.

Transformation

Our transformation step in the Python example is extremely simple. We did this for conceptual purposes. In production, this step often requires complex cleaning and restructuring of the data. price[2:] allows us to ignore the currency and other special characters within the price text. Dropping these gives us a true float value that can be used in calculations.

#remove the first two characters from the price -- we want a float, not a string
"price": price[2:]

Claude also runs into a parsing issue of its own that impacts the data. This is where the “self-healing” logic actually kicks in. Without any developer input whatsoever, Claude recognizes the issue and implements a fix on the fly, before the crawl has finished.

Claude removing the "£" from the price.
Claude removing the “£” from the price.

Saving the data

We’ve already added all of our scraped books to a list. We then use json.dump() to save our extracted data to a file. Once the file’s been saved, our data’s ready to be used.

with open("manually-scraped-books.json", "w", encoding="utf-8") as file:
    json.dump(scraped_books, file, indent=4, ensure_ascii=False)

Similarly, Claude formats all of our data inside the same type of list with the same schema. All of this was generated autonomously. The only developer input occurs when Claude takes too long and stalls. The developer simply needs to hit a button and “nudge” Claude to continue working. Once the job is finished, simply click in the “Copy” dropdown and select “Download as json.”

Saving Claude's output.
Saving Claude’s output.

Current challenges and emerging solutions

Site changes and schematic diversity are two of the biggest pain points in traditional web data infrastructure. With intelligent infrastructure, these pain points get handled gracefully with little to no human work at all — as you saw when Claude fixed extraction mid-job.

Site changes

  • Problem: Broken selectors often lead to failed extraction. Before AI-powered extraction, best practice was retry logic and eventual human review.
  • Solution: As time goes on, most major industry players are moving away from hardcoded selectors. When the page is parsed using an AI model, the model searches semantically, rather than looking for a simple selector — similar to how a human views page data.

Diversity of format and schema

  • Problem: Traditional scrapers return structured data. However, these data structures are not uniform. Their schema varies with missing and extra fields. This requires an additional transformation layer often held together with brittle code.
  • Solution: LLMs are generative. They take input and generate output. When using AI models to parse a page, you can give them any format or schema you want: JSON, CSV or XML — the list goes on and on. If you can think of a format, they’ve likely been trained on its structure. Give your model an example data structure and it will generate extraction outputs based on that structure.

The future of intelligent extraction

As time goes on, extraction software is only getting more intelligent. With increased token limits, LLMs are getting better at parsing efficiently. External tooling allows them to execute tasks with autonomy. Intelligent pipelines allow the system to take in any data format and generate output with uniform schema — tailored to fit directly inside your AI system.

  • AI-powered parsing: AI-powered extraction systems are self healing. If a selector isn’t found, the model looks elsewhere on the page.
  • MCP servers: (MCP) gives AI agents access to tools they can control. At this very moment, AI agents are navigating the web autonomously using real browsers and rendering pages.
  • Intelligent data pipelines: Generative models can parse through almost any input with semantic understanding of the data. They can then generate outputs that restructure this data into any schema and formats they’ve been trained on.

Conclusion: The future of extraction lies in natural language

As time goes on, data extraction will continue to shift more and more toward natural language. As you saw from Claude, end-to-end pipelines are now entirely possible using no-code development.

During our experiment, the only pain point was the amount of time Claude took to perform the extraction. That said, extraction using natural language is still faster, more robust and more efficient than manually writing the software in Python.

Data4AI vendors such as Firecrawl and Bright Data are already offering end-to-end extraction with zero coding involved. The border between idea and invention continues to narrow. Within a few years, hardcoded extraction is very likely to be a relic of the past.