Skip to main content

Search and extract: Why it matters

See how automated search and extraction turn the chaotic web into structured, machine ready data. Learn key steps, essential APIs, tools, and how to build this workflow

The internet is the largest repository of human knowledge we’ve ever created. Every second, billions of pages update and react to real-world events. When data scales to this point, identifying a data source is perhaps one of the hardest challenges of all. You (or your systems) need to search for sources.

Before the internet, people would go to the library (or some other knowledge repository) and perform physical searches — flipping through physical books using their fingers. Results were confined to whatever existed within the building. Nowadays, a search engine sorts and filters through all of the “buildings” simultaneousy. A search that once yielded ten or fifteen results now yields hundreds, thousands or even millions. At scale, this becomes impossible to manage without proper tooling.

In this endless ocean of information, how do collection systems choose a data source?

The short answer is rather simple and boring. Advanced systems use search engines, just like people do. First, they perform a search to find the relevant information. Then they collect information from each relevant URL. AI agents make this process even more streamlined. A mature system performs a search and identifies only the results that matter. Relevant results are then targeted for further collection and extraction. In the past, people would’ve extracted the knowledge by opening the books. Similarly, software systems open the sites and read their content.

By the time you’re finished with this article, you’ll be able to answer the following questions.

  • What does a search and extract workflow actually look like?
  • What types of tools are needed?
  • How can your team build a search and extract pipeline today?

Why search and extract matters

Software systems — even AI-powered ones — don’t actually know anything beyond their training data. Intelligent systems perform complex tasks. They need context to make that happen. This context flows downstream from a data pipeline. Then it flows into training pipelines, Retrieval-Augmented Generation (RAG) pipelines or some other external data store for the system to reference.

Now, we can break this into two core concepts.

  • Search: Sort and filter to identify the most relevant pages that match a query.
  • Extract: Fetch the important results and extract their data into a structured format for downstream usage.

The search phase determines which sources get funneled into your data pipeline. The extraction phase pulls and shapes the data so the pipeline can work with it. Your pipeline doesn’t consist of a single scraper. Web scraping is just a piece of the puzzle.

Without large-scale (usually automated) discovery and extraction, models rely on stale, incomplete and often biased datasets. The best pipelines use an automated loop where the search and extract process continually discovers new results and outputs structured data.

The anatomy of a search-and-extract workflow

Search and extract workflow loop
Search and extract workflow loop

Functional systems use repeatable workflows. Each step is designed to build on top of the last to ensure both relevance and usability. Here’s what a basic search and extract workflow looks like.

  1. Query: Everything begins with a query. The goal here is to find the best potential data sources. We don’t want to fetch data blindly. We want to discover the best pages possible.
  2. Filtering: As mentioned earlier, a single query can yield millions of results. These results need to be sorted and filtered based on relevance. Ads and duplicates need to be dropped. Irrelevant sources should be ignored.
  3. Extraction: Once our search has finished, we need to target the results. Each page needs to be fetched using either HyperText Transfer Protocol (HTTP) requests or an automated web browser.
  4. Balancing: Once we’ve got our data, we need to balance it and mitigate bias. If a dataset shows bias, we can either remove some datapoints or add some synthetic data to negate skewed patterns.
  5. Enrichment: Here, we add columns to datasets. Notes and metadata are used to help models highlight relationships within existing data.
  6. Delivery: Once the data’s been prepared, it gets transported to its destination. The spectrum here is quite broad. Some systems might deliver to a SQL database directly, others might drop a CSV in your inbox. Many teams prefer delivery straight to their cloud system like AWS S3 or Google Drive.

APIs and tools for search and extraction

We need to choose the correct tools for each step of the workflow. At each step of the data pipeline, Application Programming Interfaces (APIs) reduce boilerplate and get your team closer to shipping their work.

  • Search Engine Results Page (SERP) APIs: Retrieve search engine results in structured data formats. A messy HTML page gets converted to a uniform list of JSON objects including things like page titles, search rankings and the URL of the content.
  • Unlocker and Browser APIs: Under the hood, most web data moves using HTTP. If your data source renders content dynamically, a headless browser API is usually required. For static pages, unlocker APIs are often sufficient.
  • Collection APIs: These tools trigger scrapers or crawlers on-demand. Unlocking and parsing are often baked into these tools so teams don’t need to worry about them directly.
  • Post-processing APIs: Trigger automatic transformation and Quality Assurance (QA) processes. Here, teams might upload a dataset for enrichment and then use a separate API to run sanity checks on datasets before the move to production.

These tools are the backbone of most modern web data infrastructure. By combining tools mentioned above, your team can skip the boilerplate and focus on building things using enterprise-grade data.

Tutorial: Building a simple search-and-extract pipeline

To start, we’ll need to install Bright Data’s Python SDK. Most providers offer similar alternatives. Software SDKs can reduce error rates and drastically shorten coding time. Rather than writing raw HTTP requests, we simply import a tool class and use it to call different methods along the way.

You can install the SDK using pip.

pip install brightdata-sdk

Modern Software Development Kits (SDKs) abstract away most of the boilerplate code. Traditionally, data pipelines are built using raw HTTP requests and custom pipeline components. Tools such as this can drastically reduce the time it takes to go from idea to production.

Using the bdclient, we simply authenticate with our API token and we gain access to a fully functional SERP API. The search() method runs a search and provides our results. We can then use the parse_content() method to parse the results into structured data.

from brightdata import bdclient
import json

client = bdclient(
    api_token="<your-bright-data-api-key>",
)

search_results = client.search(
    query="detroit's best pizza",
    data_format="json"
)

clean_data = client.parse_content(
    search_results,
    extract_links=True,
    )

with open("search-results.json", "w") as file:
    json.dump(clean_data, file, indent=4)

The search results below have been shortened. Each result includes a link to the content as well as the text shown in the result. These are the same basic results you would see in your browser, they’ve just been formatted into structured JSON so AI systems and data pipelines are better able to use them.

[
    {
            "url": "https://www.waywardblog.com/best-detroit-pizza-restaurants/",
            "text": "Three Essential Detroit Pizza Places You Need to Trywaywardblog.comhttps://www.waywardblog.com \u203a best-detroit-pizza-rest..."
        },
        {
            "url": "https://www.tripadvisor.com/Restaurants-g42139-c31-Detroit_Michigan.html",
            "text": "THE 10 BEST Pizza Places in Detroit (Updated 2025)Tripadvisorhttps://www.tripadvisor.com \u203a ... \u203a Detroit Restaurants"
        },
        {
            "url": "https://www.reddit.com/r/Detroit/comments/xuq5qi/whats_the_best_spot_for_detroit_style_pizza/",
            "text": "What's the best spot for Detroit Style Pizza?Reddit\u00a0\u00b7\u00a0r/Detroit270+ comments  \u00b7  3 years ago"
        },
        {
            "url": "https://www.reddit.com/r/Detroit/comments/xuq5qi/whats_the_best_spot_for_detroit_style_pizza/iqx91yy/",
            "text": "273 answers"
        },
]

Extracting content

Once we’ve got search results, we need to extract data from the individual results. In the example below, we use scrape() to access an individual site. Once again, we use parse_content() to extract the links from the page.

from brightdata import bdclient
import json

client = bdclient(
    api_token="<your-bright-data-api-token>",
)

resp = client.scrape("https://www.waywardblog.com/best-detroit-pizza-restaurants/")

pizza_places = client.parse_content(resp, extract_links=True)

with open("pizza.json", "w") as file:
    json.dump(pizza_places, file, indent=4)

You can view our results in the image below. Like our search results above, parse_content() provides us with data structures that include the text of each link as well as its URL.

Links extracted from an individual site

Conclusion

When we search and extract, unstructured messy web data becomes accessible and reshaped for machine use. These same concepts provided above can be used for model training, RAG pipelines and agentic workflows.

There are all sorts of tools you can use for search and extract workflows. You might code your own client using a REST API and raw HTTP. Model Context Protocol (MCP) can plug these tools into AI agents for code-free search and extract workflows.