Curation is the art of selecting quality items for a maintained collection. It’s a very broad term — encompassing music, art, historical artifacts and anything else with lasting value. Just like a museum AI-powered browsing isn’t just the future. It’s already here but with some caveats. Paid tools allow you to browse the web through Natural Language Processing (NLP) rather than pre-scripted automation. However, we’re still a long way from effectively delegating the job Siri or Alexa.
Currently, AI browsing is limited mainly to developers and their AI Agents. This is changing very fast. In 2023, Large Language Models (LLMs) couldn’t even perform a Google search. Now, they’re being hooked into integrated searches. Within a few years, an LLM might execute your entire web presence while you instruct it via NLP.
In this guide, you’ll learn:
- Tools Used For AI-Powered Browsing
- How To Write Your Own AI-Powered Browser
- Current Use Cases For AI-Powered Browsing
- Industry Grade Tools For AI-Powered Browsing
Understanding the Tech Stack and Common Tools of AI Browsing
Headless Browsers
Headless browsers come in many shapes and sizes, but there are three that we commonly recognize in web development.
- Selenium: The original headless browsing framework. It’s a battle tested industry titan with over 20 years of history.
- Puppeteer: Based on the Chrome DevTools Protocol, Puppeteer gave the world a modern way to control Chromium based browsers from their programming environment.
- Playwright: Financially backed by Microsoft, the original Puppeteer brought Puppeteer’s functionality to all major browsers — and then added the versatility of Selenium.
AI Agent Integration Tools
These are the tools that actually connect your headless browser to the AI agent. Your browser isn’t capable of thinking — yet. We use tools to connect the LLM to the browser.
- LangChain: Plug your LLM into any program using a simple API. LangChain is not only a tool, but an industry leading company when it comes to AI-driven applications.
- Standardized Browser Controls: With Playwright and Puppeteer, you can call methods like
goto()andclick()via JSON RPC. This allows LLMs to feed formatted calls into a script during runtime. - AutoGPT Style Loops: AutoGPT itself is built to give self-hosted models direct control of your applications. To be clear, We’re not going full AutoGPT here — but we’ll let one of OpenAI‘s models drive.
AI-Driven UI/UX Testing
This is where AI-powered browsing really shines today. Larger models are intelligent enough to adapt in real time to complex sites.
- Self-Healing Locators: When a selector isn’t found, an intelligent model can adapt and infer a solution in real time.
- Visual Regression: When combined with screenshot abilities in a headless browser, newer models can view the page — not just the HTML. This process is expensive, but highly useful.
- AI In Test Maintenance: As mentioned, a model can adapt to missing selectors. That same model can make recommendations to fix broken selectors.
This emerging paradigm isn’t replacing developers, it’s making our jobs easier. If you’ve ever written unit tests for a React App, you’ll understand the pain that these tools can shield you from.
AI Browsing In The Real World
Before we dive into theory and deconstruction, I’d like to give the opportunity to experience these things for yourself. They are both mindblowing and mildly infuriating at the same time.
Make an OpenAI developer account if you don’t have one already and test these scripts out yourself. With GPT 4.1 Nano, you can run these demos for just a few pennies — or less.
In both of these examples, you can swap out the model for any other OpenAI model and it will make a difference, but we’ll discuss this afterward.
GPT 4.1 Nano With LangChain
This first script is very simple. We just want the model to read the heading on the page. We create a function to extract the <h1> text from the page. We then use the OpenAI API and LangChain to let the model control the function.
from dotenv import load_dotenv
from langchain.agents import initialize_agent, Tool
from langchain.agents.agent_types import AgentType
from langchain_openai import ChatOpenAI
from playwright.sync_api import sync_playwright
#load your API keys from a .env file
load_dotenv()
#extract the h1 from the page
def visit_and_extract_heading(url: str) -> str:
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
page.goto(url)
heading = page.locator("h1").first.text_content()
browser.close()
return f"Page heading is: {heading}"
#define the langchain tool
heading_tool = Tool(
name="ExtractPageHeading",
func=visit_and_extract_heading,
description="Visits a webpage and extracts the H1 heading"
)
#initialize the openai model--use any model you want
llm = ChatOpenAI(
temperature=0,
model="gpt-4.1-nano"
)
#combine the tool and llm to create an agent
agent = initialize_agent(
tools=[heading_tool],
llm=llm,
agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
verbose=True
)
#invoke the agent to run the task
result = agent.invoke({"input": "Go to https://en.wikipedia.org/wiki/OpenAI and tell me the first heading."})
print("\nResult:\n", result)

This script took over 20 seconds just to visit the page and read the header. Don’t let that fool you, the speed comes from startup itself — not the LLM. If you define functions for clicking, scrolling and text input you can run a fully automated, intelligent browser. You’re writing code and giving an LLM the power to call that code on its own — with discretion. When scaled, this will be the backbone of the future.
GTP 4.1 Nano Custom AutoGPT
This next example, while still primitive, is a bit more intricate. We create a pre-defined set of actions. Then we create an agent that can execute the actions. It literally pipes HTML into the LLM, then the agent responds with the next action from the workflow. It gets piped in as a JSON command. We’re literally building a bridge from the chat to the browser’s instruction set.
import os
import openai
import json
from dotenv import load_dotenv
from playwright.sync_api import sync_playwright
#load .env file
load_dotenv()
client = openai.OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
#commands the llm is allowed to run
ALLOWED_COMMANDS = {"goto", "fill", "click"}
#tells the gpt what to do next
def get_next_action(html, step, last_instruction):
prompt = f"""
You are an AI agent controlling a Playwright browser. You're on step {step} of a 3-step automation sequence.
You have already executed this action:
{json.dumps(last_instruction) if last_instruction else "None"}
Now, based on this HTML content:
---
{html[:2000]}
---
Respond ONLY with a valid JSON instruction, using one of:
- "goto" with a "url"
- "fill" with "selector" and "value"
- "click" with "selector"
NEVER repeat an identical action to the last one unless the page has clearly changed. Do not refresh the same page. Do not explain. Just reply with JSON like:
{{"action": "click", "selector": "#submit"}}
"""
#respond with the next action in the workflow
response = client.chat.completions.create(
model="gpt-4.1-nano",
messages=[
{"role": "system", "content": "You are a smart, obedient browser automation AI."},
{"role": "user", "content": prompt}
]
)
content = response.choices[0].message.content
#return the output from the model
try:
return json.loads(content)
except Exception:
print("Failed to parse GPT output:", content)
return None
#execute a command safely
def execute_instruction(page, instruction):
action = instruction.get("action")
#don't allow commands undefined commands
if action not in ALLOWED_COMMANDS:
print(f"Unknown action: {action}")
return
try:
if action == "goto":
#avoid pinging the same endpoint or the site homepage--we visit the root ("/") first
if instruction["url"] == page.url or instruction["url"] == "/":
print("Duplicate GOTO detected, skipping...")
return
page.goto(instruction["url"])
elif action == "fill" and "login" not in instruction["selector"]:
page.fill(instruction["selector"], instruction["value"])
elif action == "click":
page.click(instruction["selector"])
except Exception as e:
print(f"Error running action {instruction}: {e}")
#runtime for the actual agent
def run_agent(start_url, steps=3):
with sync_playwright() as p:
browser = p.chromium.launch(headless=False)
page = browser.new_page()
page.goto(start_url)
last_instruction = None
for i in range(steps):
print(f"\nStep {i + 1}")
html = page.content()
instruction = get_next_action(html, step=i+1, last_instruction=last_instruction)
if not instruction:
print("No valid instruction returned.")
break
print("GPT Instruction:", instruction)
execute_instruction(page, instruction)
last_instruction = instruction
browser.close()
if __name__ == "__main__":
run_agent("https://www.example.com")
If you ran this code, the browser should’ve opened in headful mode — this was by design so you can watch the actions being performed. The model connected to the browser, clicked a link, went to another link and clicked another link. All of this completed in about 12 seconds.

The speed increase here is a bit misleading. Our script moved faster because we didn’t have the overhead of connecting to LangChain. If you’d like to test the limits of this method, add more complex instructions — even larger models quickly get overwhelmed.
How AI Agents Operate in The Browser
Every story requires “who, what, where, when and why”. By this point we’ve defined most of that. Here are the variables we’ve covered thus far.
- Who: You and the LLM agent that you create.
- What: An automation loop containing the agent’s actions.
- Where: Wikipedia and Example Domain — simple testing grounds to demostrate concepts
- When: Just a moment ago with hands-on demos
- Why: You’re about to understand
1. Configuring Your Setup
Let’s take a look at the configuration from our code examples. In each script, we create an agent and give it access to the browser.
Here’s how we configured that LangChain example. First, we define a tool. Then, we instantiate the LLM chat. We combine these to create an agent capable of calling functions from the rest of the code.
#define the langchain tool
heading_tool = Tool(
name="ExtractPageHeading",
func=visit_and_extract_heading,
description="Visits a webpage and extracts the H1 heading"
)
#initialize the openai model--use any model you want
llm = ChatOpenAI(
temperature=0,
model="gpt-4.1-nano"
)
#combine the tool and llm to creatre an agent
agent = initialize_agent(
tools=[heading_tool],
llm=llm,
agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
verbose=True
)
Now, here’s the simulated AutoGPT configuration. While the code block here is bigger, it’s actually way simpler than what’s going on with LangChain. We create a chat — much like the chats you probably use in the ChatGPT web app. Instead of actually talking to the machine, we give it some basic context and extract a JSON browser action from the response.
client = openai.OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
#commands the llm is allowed to run
ALLOWED_COMMANDS = {"goto", "fill", "click"}
#tells the gpt what to do next
def get_next_action(html, step, last_instruction):
prompt = f"""
You are an AI agent controlling a Playwright browser. You're on step {step} of a 3-step automation sequence.
You have already executed this action:
{json.dumps(last_instruction) if last_instruction else "None"}
Now, based on this HTML content:
---
{html[:2000]}
---
Respond ONLY with a valid JSON instruction, using one of:
- "goto" with a "url"
- "fill" with "selector" and "value"
- "click" with "selector"
NEVER repeat an identical action to the last one unless the page has clearly changed. Do not refresh the same page. Do not explain. Just reply with JSON like:
{{"action": "click", "selector": "#submit"}}
"""
#respond with the next action in the workflow
response = client.chat.completions.create(
model="gpt-4.1-nano",
messages=[
{"role": "system", "content": "You are a smart, obedient browser automation AI."},
{"role": "user", "content": prompt}
]
)
content = response.choices[0].message.content
#return the output from the model
try:
return json.loads(content)
except Exception:
print("Failed to parse GPT output:", content)
return None
2. Executing Browser Actions
In both examples, the actual execution snippets are pretty small.
Here’s our our extraction function from the LangChain example. It’s pretty simple — find the <h1> and extract its text. This portion isn’t the AI, it’s what the AI can execute — on its own. Think of it like giving a button to extract the header.
#extract the h1 from the page
def visit_and_extract_heading(url: str) -> str:
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
page.goto(url)
heading = page.locator("h1").first.text_content()
browser.close()
return f"Page heading is: {heading}"
Here’s where everything actually happens. After we’ve created the agent, we tell it to run its task — visit_and_extract_heading() on the target site.
#invoke the agent to run the task
result = agent.invoke({"input": "Go to https://en.wikipedia.org/wiki/OpenAI and tell me the first heading."})
print("\nResult:\n", result)
With our custom AutoGPT, there’s more boilerplate. The page argument is the page that’s open in Playwright. The instruction is the one we extracted from the chat earlier. We take the GPT’s structured JSON and literally pass it into Playwright as a browser command. If you wanted, you could pass these commands in yourself and cut the LLM out entirely.
#execute a command safely
def execute_instruction(page, instruction):
action = instruction.get("action")
#don't allow commands undefined commands
if action not in ALLOWED_COMMANDS:
print(f"Unknown action: {action}")
return
try:
if action == "goto":
#avoid pinging the same endpoint or the site homepage--we visit the root ("/") first
if instruction["url"] == page.url or instruction["url"] == "/":
print("Duplicate GOTO detected, skipping...")
return
page.goto(instruction["url"])
elif action == "fill" and "login" not in instruction["selector"]:
page.fill(instruction["selector"], instruction["value"])
elif action == "click":
page.click(instruction["selector"])
except Exception as e:
print(f"Error running action {instruction}: {e}")
Here’s the actual agent runtime. It calls all the other code we defined in the script.
#runtime for the actual agent
def run_agent(start_url, steps=3):
with sync_playwright() as p:
browser = p.chromium.launch(headless=False)
page = browser.new_page()
page.goto(start_url)
last_instruction = None
for i in range(steps):
print(f"\nStep {i + 1}")
html = page.content()
instruction = get_next_action(html, step=i+1, last_instruction=last_instruction)
if not instruction:
print("No valid instruction returned.")
break
print("GPT Instruction:", instruction)
execute_instruction(page, instruction)
last_instruction = instruction
browser.close()
3. Adjusting Scripts Based on Dynamic Content
Our LangChain example doesn’t adapt in real-time, but our AutoGPT script does give a bit of room for this. If you try running this script on a different domain like Reddit, the script runs into real problems. It tries to perform browser actions, but quickly breaks down.

This wasn’t consistent, but GPT 4.1 (full model), was successfully able to execute all three of the instructions — 4.1 Nano didn’t succeed at this in almost 50 runs. It clicked the login button and filled information into the login modal. This is real-time adaptation to the page.

The screenshot below is even more eye opening. The model attempted to click the login button and failed. When it failed, GPT 4.1 ran goto with Reddit’s actual login endpoint. This is contingency planning in real-time.

Self-healing scripts are absolutely possible, but only when a model is capable of reasoning across steps, recovering from failure and adjusting to unexpected outcomes.
Use Cases and Challenges in The Real World
Why to Use AI-Powered Browsers
- Self-Healing Scripts: As you’ve already seen, the model can adapt to the page in real-time. This is much stronger than a hardcoded selector.
- No DOM-Specific Scripting: The model handles the actual DOM code. You tell it what to do in natural language.
- Generalized Logic: Since you’re instructing it via natural language, the model can use the same context across multiple sites, regardless of selectors — we saw this in action with Reddit.
- Adaptive and Improvisational QA: You can write an agent similar to the ones we created here, but enhance the prompt:
You are a QA tester. Make sure the login button works and that content is loading properly.... The model doesn’t just follow a script, it adapts to the site as it changes.
Why Not To Use AI-Powered Browsers
- No Guarantees: In our testing, outputs weren’t always consistent. GPT 4.1 was really impressive, but even 4.1 wasn’t able to run all three steps every time.
- High Cost: Running larger models is expensive. For its success, GPT 4.1 wasn’t bad, but GPT 4 was a money pit when testing. If you’re scraping or running end to end tests, these costs add up fast.

- Limited Debugging: It’s difficult to tell why a model would choose a bad selector (nano did this repeatedly). The only real solution — throw more money at it by upgrading the model or retrying the actions.
How To Choose The Right AI Browser Automation Tool
AI-Driven Element Recognition
If you’re using an LLM to browse the page, it needs to understand the selectors on the page — and it can’t make them up the way that Nano did. When you create your agent, it needs to know what’s going on with the web page.
Self Healing Scripts
As you saw with the 4.1 full model, it ran to Reddit’s login page when the login button failed. This is self-healing logic. A selector failed and during runtime, the model tried to fix it. If your agent can’t adapt, it won’t be able to handle real workflows.
Automation for Scraping and Testing
In production, your setup needs to handle your use case — likely scraping or testing. This requires significantly more boilerplate than you saw in our examples here. With LangChain or the AutoGPT example, you need to add code for every possible interaction on your site.
Industry Grade Tools For No-Code Browser Automation
axiom.ai

Axiom allows you to automate interactions on any site — without code. You can automate through spreadsheets and interact with the page. With Zapier, you can connect it to almost any API imagineable. If you connect it straight to ChatGPT, your entire browsing process can controlled by your LLM assistant.
Testim

Testim is built with end to end testing in mind. Remember that React app I mentioned earlier? Testim can help with that. Testim will generate tests based on the behavior of your site or app.
Ghost Inspector

Like Testim, Ghost Inspector will create automated tests for you. Build your site and tell Ghost Inspector to write the tests. This is great for making sure things still work after deploying changes.
AI-Powered Browsing: The Future Is Here
AI browser automation is no longer theoretical. You can plug an LLM into a browser, give it a task and some context and it can get the job done. LLMs are already capable of adapting to site changes that hardcoded selectors can’t.
This power comes with its own set of tradeoffs though. Small models can’t reliably operate a browser. Large models are incredibly expensive at scale. In time, this will change, but we’re just not there yet.
AI-Powered browsing is in its infancy, but it’s now happening before our very eyes. If you’re looking to automate tests, or complex site interactions, these tools are ready for you today. In time, these costs will come down and likely, AI-powered browsing will replace standard browsers in the coming decades.