Skip to main content

Access Granted: A scalable guide to unblocking websites for AI data collection

Explore how modern AI teams — Machine Learning Engineers, AI Developers and anyone else needing programmatic access to web data — can effectively beat these barriers

Automated access to websites can be challenging in traditional scraping pipelines. When AI agents are introduced into the mix, these problems compound with a new architecture and no clearly defined standard. In this guide, you’ll learn how to navigate the following challenges.

  • Rate Limits
  • CAPTCHAs
  • Device Fingerprinting
  • JavaScript Rendering
  • Geoblocking

Introduction: The data access bottleneck for AI

AI systems need access to real-time data. Whether you’re training AI or building an agent to make decisions based on context — your model needs fresh and relevant data. However, some of the most valuable web data sits behind layers of anti-bot technology.

AI agents often make these challenges even more difficult to navigate. Traditionally, you would hardcode human-like actions into your scraper: Clicks, scrolls and waits. AI Agents have read about human behavior, however, they haven’t necessarily learned to act human. C-3P0 can handle human-cyborg relations, but it’s still obviously a droid. LLMs might understand how to use an MCP server, but they don’t always understand how to look like a human when doing it.

Understanding common website blocking techniques

Most sites don’t use a single method to block malicious traffic to their websites. They’ll use a layered stack that operates both client side and server side. Now, it’s time to take a look at each of these methods to understand what they do and why they matter.

  • Rate Limits: This is done to protect servers from overload. When requests come from a single IP address too quickly, the client is told to wait. This is often done using status code 429: “Too Many Requests.”
  • CAPTCHAs: CAPTCHAs are among the most common obstacles in data extraction. When your agent or scraper receives a CAPTCHA, the site already suspects programmatic access. If you receive a CAPTCHA, your agent or scraper has been spotted.
  • Device Fingerprinting: Modern browsers now include hardware information, fonts and screen resolution. This hardware information gets hashed along with your IP address and software profile to create a unique serial number. This hash is then used to track your activity.
  • JavaScript Rendering: Many sites use JavaScript to render content dynamically. When you click a button to load something on the page, this is JavaScript in action. Without JavaScript, the content can’t be loaded and accessed.
  • Geoblocking: If your scraper or agent shows up in a different country than your target region, you may be shown different content or may not have access to certain data that is only available to users in specific locations.

Some of the best anti-bot products on the market use techniques listed above.

  • Cloudflare Turnstile: Uses rate limiting, device fingerprints, JavaScript and IP analysis to detect and manage automated traffic.
  • Datadome: Device fingerprinting, behavior analysis and machine learning to identify and respond to bot activity.

Building your unblocking toolkit: Essential tools and services

With AI agents, we can emply the same techniques use in traditional web scraping. The list below is non-exhaustive, but gives a strong foundation. For context, you should understand that all browsers communicate through the internet using HyperText Transfer Protocol (HTTP). You can learn more about HTTP from Mozilla here.

Proxy services

With a proxy service, your HTTP traffic is routed through another machine — a proxy. Let’s pretend your browser wants to fetch a site. Typically, your browser sends a GET request to the target site. Then, the site sends back an HTML page for your browser to render.

With a proxy, there are a few more steps between your machine’s GET and the response it receives.

  1. Your machine sends a GET request to the proxy server.
  2. The proxy server routes that get request through another machine — the proxy.
  3. The proxy fetches and receives the page.
  4. The proxy sends the HTML page to the proxy server.
  5. The server forwards the HTML page to your browser.

This workflow allows for a redundant connection that can manage rate limits, fingerprinting and geoblocking. There are three main proxy types. Both Bright Data and Oxylabs offer these solutions.

  • Residential Proxies: Run your HTTP traffic through a residential machine on a real home network. Both static and rotating proxies are available.
  • Mobile Proxies: Your traffic runs through mobile data networks — making it difficult to differentiate between your AI agent and a mobile phone.
  • Datacenter Proxies: These proxies offer a low cost option for scraping simpler sites.
  • ISP Proxies: Route traffic through an Internet Service Provider (ISP) of your choice.

CAPTCHA handling

We’ve all been there — using a website and BAM! You need to drop everything you’re doing to waste precious seconds of your life on solving a CAPTCHA. Without CAPTCHA handling, your extraction pipeline dead in the water.

Here are some CAPTCHA solving solutions that will help keep your AI agent up and running.

Anti-Captcha: Human-powered solving similar to 2Captcha with support for CAPTCHAs of all shapes and sizes.

2Captcha: Route your CAPTCHA to a network of real humans for solving.

Capsolver: Use a specialized AI model to automatically recognize and handle CAPTCHAs as they appear.

Headless Browsers

Headless browsers give your AI agent the tools to render JavaScript. With a headless browser, your AI agent can do anything a human would when using the web. Many headless browsers exist, but you need to be aware of the big three. Each of these browsers can be controlled programmatically.

  • Selenium: Battle tested and industry trusted. Selenium has been used for over 20 years in data extraction and web development.
  • Puppeteer: Based on the original Chrome DevTools Protocol (CDP), Puppeteer allows you to automate a Chromium-based browser with AI agents or code.
  • Playwright: Playwright expanded the API from Puppeteer into all major browsers — Chrome, Gecko (Firefox) and Webkit (Safari).

Hybrid and Managed Solutions

The tools above, when combined, make up the full stack for managing access to websites with advanced traffic controls. Modern services provide the entire stack for you. Whether you’re using an MCP server or building a custom integration, plug your agent into these tools and let it go to work.

  • Bright Data
    • Unlocker API: Use managed proxies and CAPTCHA solving to access almost any site on the web.
    • Browser API: Combine CAPTCHA solving, proxies and a headless browser for JavaScript rendering.
  • Oxylabs
    • Scraper API: Unblock almost website with proxies and CAPTCHA avoidance.
    • Unblocking Browser: Headless browser with CAPTCHA avoidance and proxy integration.
  • ZenRows

While many teams use unlockers and unblockers for accessing multiple domains, they’re equally useful when revisiting the same domain at scale. High frequency access to any domain can still trigger anti-bot systems, even when operating within rate limits. These managed and hybrid solutions provide stable and reliable access by automatically handling such challenges.

Strategic unblocking: Practical techniques for scalability

Whether creating your own stealth toolset or using a managed solution, your AI agent follow the strategies below to succeed at scale while optimizing for resource consumption.

  • Rotate Proxies: Datacenter proxies are the default. Your system should try to access with a datacenter connection first and upgrade to either residential or mobile as a fallback.
  • Headless Browsers: These are resource intensive. A browser should only be used if JavaScript is required. Static parsers such as BeautifulSoup should be used whenever the task allows. This increases speed and lowers your operating cost.
  • CAPTCHA Handling: This should be integrated into your headless browser. They are not necessarily needed 100% of the time, and should only be executed when a CAPTCHA is on the page. Hardcode selectors to find CAPTCHAs or prompt your agent to look for them.
Tool TypeHandles CAPTCHAsHandles JavaScriptManages IP RotationScalable at VolumeBest Use Case
CAPTCHA Solvers✅ Yes❌ No❌ No⚠️ ModerateSolving visual or checkbox CAPTCHAs on demand
Headless Browsers⚠️ Conditional✅ Yes❌ No⚠️ Resource-HeavyRendering JavaScript-heavy sites; simulating users
Proxy Services❌ No❌ No✅ Yes✅ HighRotating IPs to increase anonymity and access localized content
Anti-Bot APIs✅ Yes✅ Yes✅ Yes✅ HighEnd-to-end unblocking across multiple barriers

How to manage specific blocking mechanisms

When you run into anti-bots, there are numerous strategies to help diagnose and mitigate the problem. These methods will help you identify the strategy and use the appropriate contingency plan.

Geoblocking: Geoblocking is typically resolved through proxy rotation. When you rotate proxies, your IP address changes. If you need to appear in a specific country, use your proxy provider’s geo-targeting features.

Rate Limiting: Watch for an error message like “Status Code 429: Too Many Requests.” Implement a backoff algorithm to reduce request frequency and get past the error code. Or use specialized APIs that will do this for you.

CAPTCHAs: Use predefined selectors to check for CAPTCHAs on the page. Pass a list of selectors for your agent to check. Only trigger the solver when a CAPTCHA has been detected.

Fingerprinting: Use tools that maintain realistic device fingerprints. Avoid the default settings of Puppeteer, Playwright and Selenium — these default settings often leave an obvious device fingerprint.

JavaScript Rendering: Your agent should detect it when a page is missing content. If the agent sees that JavaScript is required, it should retry using a headless browser.

Scaling your unblocking infrastructure

Distributed Browser Management

All software inherits the weakness of its host machine. It’s far safer to run your browsers in the cloud than on a local machine. In the past, websites were often hosted on local machines. This is catastrophic in the event of something like a power outage — your entire system goes down. The cloud gives your agents stability, redundancy and the ability to scale with demand.

Proxy Orchestration

You can rotate proxies manually in code, but this is a relic of the past. The tools we’ve discussed in this guide — Bright Data, Oxylabs and ZenRows — will handle proxy rotation for you. You don’t need to maintain a pool and diagnose the health of a proxy connection on your own. If a proxy fails, your provider simply routes you through a healthier proxy from their pool so you can perform extraction with peace of mind.

Pipeline Resilience

When something in your pipeline does fail, it needs to be retried. After the max retries have been reached, the failure should be logged. Logging and retry logic are simple concepts that often sadly get overlooked during development. While simple, these concepts are really the backbone of resiliency and diagnostics. Don’t ignore the little things.

Monitoring and maintaining your unblocking system

  • Log Everything: Logs are root of any monitoring system. When a system fails, your log files should tell you why. You can even build dashboards using your logs on the backend. Documented failures are solvable ones.
  • Be Adaptable: Your AI agent needs to adapt to site changes. It should understand the site contextually. AI agents give you the power for self-healing extraction, use it.

Paving the way for scalable AI data collection

When building scalable AI, the hardest part is the data — not the LLM. Your LLM can be easily swapped out for another one. The data pipeline to the LLM is where all the agent’s context comes from. With proper planning, the hurdles presented in this article can be solved with elegant and robust solutions.

With the right stack, your AI agent can navigate the modern web like a human.