Skip to main content

Modern Python SDKs: Smarter web scraping and data extraction with less code

Discover how today’s Python SDKs let developers build full data pipelines in just a few lines of code and improve the web scraping efficiency

What is an SDK?

A Software Development Kit (SDK) refers to an external package or library built to make software development easier. SDKs allow us to bypass much of the boilerplate phase when writing code. Instead of writing large templates to start a scraper, you simply import the SDK and start scraping.

Focus on logic, not boilerplate

One of the most tedious pain points in software development comes from writing boilerplate code. To get a better understanding of the boilerplate involved in scraping, let’s take a look at a standard template for a web scraper.

The code below holds zero actual scraping logic. We set up a dummy proxy connection and use a while loop to hold our actual scraper. First, we need to get() the page. Then we pass it into BeautifulSoup’s HTML parser and we need to write hardcoded extraction logic from there. Everything is handled manually. This makes for brittle code that breaks when things don’t go as expected.

import requests
from bs4 import BeautifulSoup
import json

PAGES_TO_SCRAPE = 10

page_number = 1

scraped_data = []

proxy_url = "http://<your-username>:<your-password>@<your-provider-url>:<your-port-number>"
proxies = {
    "http": proxy_url,
    "https": proxy_url
}

while page_number <= PAGES_TO_SCRAPE:

    #get the page
    response = requests.get(f"https://example.com", proxies=proxies)
    #initiate the parser
    soup = BeautifulSoup(response.text, "html.parser")

    """
    - scraping logic goes here
    - find target objects on the page
    - iterate through the objects and extract their data
    ...
    """
    page_number+=1

with open("manually-scraped-data.json", "w", encoding="utf-8") as file:
    json.dump(scraped_data, file, indent=4, ensure_ascii=False)

Without any real logic, our file is already 33 lines. If you add in retry logic, proxy rotation and data extraction, this file could easily grow to 70 lines — for a simple website. In a perfect world, SDKs are designed to remove most of this boilerplate. Instead of starting with 33 lines, you should be able to get started with three.

The sad reality of most web scraping SDKs

For the last decade, providers have offered SDKs but the offerings often come with glaring limitations. They often provide an incomplete product suite, dump raw HTML and give obscure error types that developers just don’t have time to deal with.

Rather than learning new data types and memorizing entire libraries, developers tend to go back to their comfort zone — basic HyperText Transfer Protocol (HTTP) requests and the manual parsing you see from the example above.

Why spend valuable hours learning a new SDK when a few lines of boilerplate does the job just as well with no learning required?

How and why are SDKs changing now?

Since the rise of AI and web unblockers, SDKs are getting smarter. Newer packages are offering dynamic extraction with robust proxy solutions and even coding logic executed via natural language — instead of writing scrapers, you can plug tools into an AI agent and simply tell it what to do.

As the industry itself continues to shift toward no-code scraping solutions, even traditional coding requires much less work.

Providers are all offering smarter SDKs that allow you to get more done using less code. Packages are now empowering you to start scraping in minutes. Take a look at some of the SDKs offered by our vendors here at Data4AI.

  • Bright Data: Full integration with web unblocking, headless browsers, automated parsing and extraction and much more. Create full data pipelines in just a few lines of code.
  • Firecrawl: Automate full crawls with custom output formats like markdown and HTML. Extract sitemapping and even perform asynchronous jobs for reactive programming.
  • Reworkd: Built on top of the Harambe web scraping SDK, automate entire data pipelines straight from your Python environment.
  • ZenRows: Manage proxies, automated extraction and custom hardware profiles all from a single library. Make asynchronous requests and even render a headless browser all using a single package.

The state of scraping and extraction SDKs

The packages listed above are just a few of the scraping SDKs available right now. Almost every web data infrastructure provider is racing to make their offerings smarter and more robust. Data extraction is getting smarter. As time goes by, you’re going to get more done with less code.

Web data collection SDKs are all heading in the same direction. Here are some common features being offered in web scraping SDKs as of 2025.

  • Automated unblocking: Gone are the days of manual proxy management. Application Programming Interfaces (APIs) now connect, rotate and manage your proxy health.
  • Browser rendering: Local browsers make your data pipeline vulnernable to local problems — power outages, connectivity along with other issues. Run a browser in the cloud to mitigate these issues.
  • AI integration: With intelligent extraction, developers can simply tell AI models what to extract via natural language processing (NLP). Simple website changes no longer break the pipeline.
  • Custom schemas and outputs: Generative AI allows models to take any input, extract your data and produce a custom output — JSON, XML, CSV, Parquet or even something you make up yourself.
  • Model Context Protocol (MCP): This is the most dramatic shift. SDKs themselves are being replaced by MCP servers. Instead of writing code, developers plug tools into a model and tell the model what to do.

Bright Data’s Python SDK: The code speaks for itself

Now that we know where web scraping SDKs are heading, it’s time to take a look at how they look in 2025. This is a night and day difference from even just a couple years ago. Enough talk, we’ll let the code speak for itself.

Innovations and simplicity

The main methods behind the Bright Data Python SDK are scrape() and search().

Installation

pip install brightdata-sdk

scrape()

client.scrape("https://books.toscrape.com")

This automates a large portion of our boilerplate. With scrape(), the SDK sets up a connection via Unlocker API. It solves any CAPTCHAs and then returns the page.

To connect via the Unlocker API, normally you would use a POST request and hold all of your parameters in a JSON payload. As you can see in the snippet below, this is far more work than the single line we used above. This simple POST request is even more difficult than the proxy port integration we used in the manual scraper at the beginning of this piece.

import requests

url = "https://api.brightdata.com/request"

payload = {
    "zone": "web_unlocker1",
    "url": "https://example.com/page",
    "format": "json",
    "method": "GET",
    "country": "us",
    "data_format": "markdown"
}
headers = {
    "Authorization": "Bearer <token>",
    "Content-Type": "application/json"
}

response = requests.post(url, json=payload, headers=headers)

print(response.json())

The search() method is just as expressive. A single line of code retrieves SERP results and returns them in cleanly structured JSON.

client.search("best Python web scraping libraries", parse=True)

To write an SERP scraper would be a completely separate project, so we’re not going to give a code example of it. You can view our top three results in the snippet below.

"organic": [
        {
            "link": "https://www.reddit.com/r/Python/comments/vncw6d/what_is_the_best_library_for_website_scraping/",
            "display_link": "40+ comments \u00b7 3 years ago",
            "title": "What is the best library for website scraping? : r/Python",
            "description": "I tend to use Beautifulsoup for simple things, and Selenium for more complex things (sites with Javascript, or where headless scraping doesn't ...",
            "rank": 1,
            "global_rank": 1
        },
                ...                

search() gives us structured JSON results with consistent schema. Even as is, this output is ready-to-use within most programming environments.

What once took hundreds of lines of code now takes less than 50

Now, let’s take those two examples from above and expand this into a multifunctional scraper. Our scraper will first perform a search for best Python web scraping libraries. Then, it’ll extract all the text, links and images from Books to Scrape. Finally, with just a few more lines of code, we’ll record a prompt and response from ChatGPT. All data will be saved to individual JSON files.

We’re creating three separate, fully functional data pipelines using less code than developers traditionally use in a single scraper.

from brightdata import bdclient
import json


#initialize the client
client = bdclient(
    api_token="<your-bright-data-api-key>",
    web_unlocker_zone="<your-zone-name>"
    )


#search the web
results = json.loads(client.search("best Python web scraping libraries", parse=True))


with open("google-results.json", "w") as file:
    json.dump(results, file, indent=4)


#scrape a page -- extract text, links and images
scraped_books = client.parse_content(
    client.scrape("https://books.toscrape.com"),
    extract_text=True,
    extract_links=True,
    extract_images=True)


with open("scraped-books.json", "w") as file:
    json.dump(scraped_books, file, indent=4)

    
    prompt = """
    What can you tell me about Data4AI (https://data4ai.com)? Please keep your answer short and factual.
    """

#ask chatGPT about data4ai, record the response
ai_response = client.search_chatGPT(prompt=prompt, web_search=True, sync=True)


response = ai_response["answer_text"]


prompt_response_object = {
    "prompt": prompt.strip(),
    "response": response
}


with open("prompt-response.json", "w") as file:
    json.dump([prompt_response_object], file, indent=4, ensure_ascii=False)
  • SERP data pipeline: client.search() handles the entire SERP scrape from start to finish. It actually takes more code to write the JSON file than to get the search results.
  • Books data pipeline: We get our results using client.scrape() and we pass its output directly into client.parse_content() which extracts the data.
  • ChatGPT pipeline: With client.search_chatGPT(), we get the power to simply ask a question and receive an answer. We set both web_search and sync to True. This allows us to maintain a connection with the chat session and record our results once they come in.

With the Bright Data Python SDK, not only can you start scraping in minutes, you can automate entire pipelines in just two or three lines of code.

Conclusion: Let the past REST in peace

Python files are getting smaller and smaller. What once took hours, even days, can now be accomplished with an API key and just a few lines of code. In the past, Representational State Transfer (REST) was the standard — HTTP requests with manual parsing and proxy rotation. In this day and age, skipping the boilerplate is just the tip of the iceberg. What once took weeks to develop can be accomplished in a single day — sometimes even just a few minutes.