Skip to main content

JavaScript web scraping: Techniques, tools and best practices

Master JavaScript-based scraping with headless browsers, DOM parsing, and automation frameworks. Explore tools, anti-bot tactics, and clean data workflows.

JavaScript is a powerful, lightweight language for web development. In this guide, you’ll learn the basics of web scraping in JavaScript — what it is, how it works and which tools to use.

  • HTTP Requests: Understand HTTP requests are and how to make a GET request.
  • Parsing: Extract data from HTML pages using Playwright for dynamic content or Cheerio or static pages.
  • Proxies and CAPTCHA Solving: Integrate with Web Unlocker for proxy management and CAPTCHA solving.

What is web scraping?

To scrape the web, you programmatically access a website and then extract its data from the HTML on the page. Regardless of your target site, the goal is the same — automate the overhead — and capture the data that matters.

Depending on your target site and its individual complexity, you might use a variety of different tools: Headless browsers for dynamic environments, static parsers for simple extraction and proxy services to maintain reliable access and view localized content. Let’s break this down so you know what the tools are and when to use them.

Use cases for JavaScript web scraping in AI training

When training an AI model, almost all of your data will come from the web. If you can harvest it yourself, you get granular control over your training data and you can streamline your pipeline.

If your model requires any of the following training data, web scraping can help you.

  • eCommerce Data: Product data and pricing
  • Market and Customer Sentiment: Reviews and general opinions
  • Unstructured Data: Images, text and video.

How does web scraping work?

Web Scraping Diagram

Regardless of your tooling, we follow the same basic process. You’re writing a program to automate the following steps.

  1. Request the Page
  2. Receive the Response
  3. Process the Response
  4. Extract the Data

Why JavaScript?

Most web scraping tutorials are written in Python and Python is a great language. It’s readable, easy to learn and doesn’t require semicolons. When thinking practically, Python introduces bottlenecks not usually seen with the web — and it’s not limited to just scraping.

JavaScript is the native language of the web. It’s used to render pages, load content and handle browser actions (like clicking a button). Your browser likely ran JavaScript just to get you here. When using Python, even with a headless browser (we’ll talk about these more soon enough), you’re actually controlling a JavaScript environment through your Python interpreter.

JavaScript gives us all of the following features you’d expect from a production scraping environment.

  • Async Programming: NodeJS gives us native support for async operations. No need to install asyncio and duct tape your script together.
  • DOM Operations: When a page changes before your eyes, this is JavaScript in action. Controlling this process through Python adds unneeded overhead to your machine.

Tech and tools

Let’s get a little more hands on. If you don’t have it already, you need NodeJS. Their downloads page is available here. Once you’ve got it installed, we’ll setup a new project.

Create a new folder, either with your file explorer or using the command line.

#make the new folder
mkdir js-web-scraping
#cd into the folder
cd js-web-scraping

Now, let’s create a new JavaScript project inside the folder. The following command does exactly that. You’ll see a package.json appear inside your folder after execution. You can learn more about different npm init commands here.

#start a new project
npm init --y

Our end goal is to extract data. To store it, we need the fs module — this gives us access to the filesystem.

npm install fs

Headless browsers

Headless browsers allow you to control a real browser from your programming environment. They enable your program to scroll pages, click buttons and fill input boxes — just like a real user.

  • Selenium: The oldest of the bunch. Use the Webdriver API to control the browser. Selenium is battle-tested, but lacks some of the lightweight flexibility of newer packages.
  • Puppeteer: The original headless browser based on the Chome DevTools Protocol. If you like a minimalist environment and you’re only using a Chrome-based browser, Puppeteer is great.
  • Playwright: The shiniest and newest of the bunch. Playwright is actually a polished expansion of the original Puppeteer browser — built mostly by the same team with more features and support for multiple browser engines.

Let’s go through and perform a crawl using Playwright. We’re going to extract all the information we need from Quotes To Scrape — a website designed specifically for scraping practice. This gives you a learning environment you can reproduce.

Scraping with Playwright

To get started, you need to install Playwright. Run the command below to do so.

npm install playwright

Now, you need to make sure you’ve got browsers it can access. This next npx command will help you with that.

npx playwright install

Here’s the actual code.

//import statements
const playwright = require("playwright");
const fs = require("fs");

const framework = "playwright";

(async () => {
    //array to hold the results
    const extractedData = [];
    //open a browser
    const browser = await playwright.chromium.launch({headless: false});

    //open a new page
    const page = await browser.newPage();
    await page.goto("https://quotes.toscrape.com");


    //find the login button and click on it
    var loginButton = await page.$("a[href='/login']");
    await loginButton.click();

    //find the username and password boxes
    const usernameBox = await page.$("input[id='username']");
    const passwordBox = await page.$("input[id='password']");

    //fill out the boxes
    await usernameBox.fill("my-username");
    await passwordBox.fill("my-password");

    //find and click the login button to submit the inputs
    loginButton = await page.$("input[type='submit']");
    await loginButton.click();

    //find the next button on the page
    var nextButton = await page.$("li[class='next'] > a");
    while (nextButton) {
        //find all the quote divs
        const quoteCards = await page.$$("div[class='quote']");

        for (const quoteCard of quoteCards) {
            //find the actual quote
            const quoteText = await quoteCard.$("span[itemprop='text']");
            //find the author
            const author = await quoteCard.$("small[itemprop='author']");

            //build a quote object from the text
            quote = {
                text: await quoteText.innerText(),
                author: await author.innerText()
            }

            //add it to the extracted data
            extractedData.push(quote);
        }

        //click on the next button if it's there, otherwise exit the loop
        nextButton = await page.$("li[class='next'] > a");
        if (nextButton) {
            await nextButton.click();
        }
    }

    //finished scraping, close the browser
    await browser.close();

    //save the extracted data
    fs.writeFileSync(`${framework}-scraped-quotes.json`,
        JSON.stringify(extractedData, null, 4)
    );
})();

In this process, we run the following steps.

  • Launch a Chromium browser using Playwright
  • Navigate to the Quotes to Scrape homepage
  • Click the login button
  • Enter our login information
  • Submit the login (this always succeeds, it’s a practice site)
  • Crawl the entire site and extract every quote along with its author
  • Write our extracted data to a JSON file

Real world performance

Here is the actual output when running the file from VSCode. As you can see, we crawled the entire site in 4.115 seconds.

Playwright Output

The full JSON file is 402 lines long. Here’s some sample output so you can see what we’re extracting.

[
    {
        "text": "“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”",
        "author": "Albert Einstein"
    },
    {
        "text": "“It is our choices, Harry, that show what we truly are, far more than our abilities.”",
        "author": "J.K. Rowling"
    },
    {
        "text": "“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”",
        "author": "Albert Einstein"
    },
    {
        "text": "“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”",
        "author": "Jane Austen"
    },
    {
        "text": "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”",
        "author": "Marilyn Monroe"
    },
    {
        "text": "“Try not to become a man of success. Rather become a man of value.”",
        "author": "Albert Einstein"
    },

Static HTML parsers

When using static parsers, we combine them with HTTP requests (the protocol underlying most of the web). These setups are simplistic and barebones by design. We don’t want all the bells and whistles of a headless browser and we don’t want extra overhead.

For 90% of scraping jobs, these tools are more than enough. Use this setup if you don’t need a full browser, page interactions, or dynamic content — just get the page and strip the data.

Scraping with Axios and Cheerio

To get started, you need an HTTP client. We’ll use Axios to make our HTTP requests simple and straightforward.

npm install axios

Now, we need an HTML parser, we’ll use Cheerio. JSDOM offers full DOM simulation for complex parsing but it’s slower with more overhead — not quite a static parser and not a full browser either.

npm install cheerio

The example below runs a simplified version of the scraper we made using Playwright. We don’t login or click buttons — we can’t. We just want to get the page data.

//imports
const axios = require("axios");
const cheerio = require("cheerio");
const fs = require("fs");

const framework = "static-parser";

(async () => {
    const extractedData = [];

    const baseUrl = "https://quotes.toscrape.com";
    let currentUrl = baseUrl;

    while (currentUrl) {
        //get the HTTP response
        const response = await axios.get(currentUrl);
        //load its "data" object into Cheerio
        const $ = cheerio.load(response.data);

        //iterate through the quotes on each page
        $("div[class='quote']").each((index, quoteCard) => {
            const quoteText = $(quoteCard).find("span[itemprop='text']").text().trim();
            const author = $(quoteCard).find("small[itemprop='author']").text().trim();
            const quote = {
                text: quoteText,
                author: author
            };
            extractedData.push(quote);
        });

        //find the link to the next page
        const nextLink = $("li.next > a").attr("href");

        //if it doesn't exist, convert our currentUrl to null and exit the loop
        currentUrl = nextLink ? `${baseUrl}${nextLink}` : null;
    }

    //write the results to a json file
    fs.writeFileSync(
        `${framework}-scraped-quotes.json`,
        JSON.stringify(extractedData, null, 2)
    );
})();

The script above executes the following steps.

  • Send an HTTP request for the webpage
  • Load response.data into Cheerio
  • Parse each quote to extract both its text and author
  • Find the link to the next page
  • If a link doesn’t exist, exit the loop
  • Write our results to a JSON file

Higher speed and lower overhead

Minimal setups like this can lead to massive performance gains. With no real browser, we executed the same scrape in 1.325 seconds. This is more than a 66% gain in performance — by simply doing less.

Static Parser Output

Here’s a sample snippet from the extracted data. It should look pretty familiar. This file is also 402 lines long.

[
  {
    "text": "“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”",
    "author": "Albert Einstein"
  },
  {
    "text": "“It is our choices, Harry, that show what we truly are, far more than our abilities.”",
    "author": "J.K. Rowling"
  },
  {
    "text": "“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”",
    "author": "Albert Einstein"
  },
  {
    "text": "“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”",
    "author": "Jane Austen"
  },
  {
    "text": "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”",
    "author": "Marilyn Monroe"
  },
  {
    "text": "“Try not to become a man of success. Rather become a man of value.”",
    "author": "Albert Einstein"
  },

Note: Static parsers are excellent for what they are, but they come with some inherent limitations. You cannot render JavaScript or interact with the page when using a static parser. Your program is basically just reading and parsing a really long text file.

Proxies and CAPTCHA handling

When using the aforementioned tools in the real world, you’re going to run into roadblocks. Maybe not all the time, but it will happen — this is a fact.

You’re likely to run into the following issues in the wild:

  1. IP blocks
  2. Rate Limiting
  3. CAPTCHAs

Proxies help distribute traffic across multiple IP addresses, which can reduce the likelihood of triggering rate limits or access controls being triggered. Some advanced platforms also offer CAPTCHA handling to maintain uninterrupted access to public web data.

To use a proxy with Axios, install https-proxy-agent.

npm install https-proxy-agent

In your real browser, take a look at NopeCHA’s reCAPTCHA demo page. As you can see in the image below, we have three CAPTCHAs that need to be solved.

NopeCHA: reCAPTCHA Demo Page

Bright Data’s Web Unlocker API

Web Unlocker API comes not only with built-in proxy rotation, but a CAPTCHA solver by default.

Take a look at the code below.

//ignore TLS errors
process.env.NODE_TLS_REJECT_UNAUTHORIZED = '0';

//imports
const axios = require("axios");
const { HttpsProxyAgent } = require("https-proxy-agent");
const fs = require("fs");
const cheerio = require("cheerio");

const framework = "web-unlocker";
const url = "https://nopecha.com/captcha/recaptcha";

//proxy url--replace username, zone_name and password with your own credentials
const proxy = "http://brd-customer-<YOUR_USERNAME>-zone-<YOUR-ZONE-NAME>:<YOUR-PASSWORD>@brd.superproxy.io:33335";

(async () => {

    //we use a try-catch block here to handle any errors
    try {

        //create a new proxy agent
        const agent = new HttpsProxyAgent(proxy);

        //pass it in with our get request
        const response = await axios.get(url, {
            httpsAgent: agent
        });

        //write the html page to a file for later viewing
        const html = response.data;
        fs.writeFileSync(`${framework}-result.html`, html);

        //load it into cheerio
        const $ = cheerio.load(html);

        //check for a captcha
        const hasCaptcha = $('div.g-recaptcha').length > 0 || html.includes("data-sitekey") || html.includes("Please solve the CAPTCHA");
        

        //log the result
        if (hasCaptcha) {
            console.log("CAPTCHA still present: page was not unlocked.");
        } else {
            console.log("CAPTCHA bypassed: page was successfully unlocked.");
        }

    } catch (error) {
        console.error("Error:", error.message);
    }
})();

This code does the following.

  • Turn off TLS alerts (this is common practice when using proxies)
  • Request the page from NopeCHA
  • Check for a CAPTCHA (the page gives three by default)
  • Save the raw HTML page to a file for human review
  • Log whether a CAPTCHA was found on the page

Performance impacts

When using proxies and CAPTCHA solvers, your performance is going to take a hit. You’re trading some of your performance to secure access to content you wouldn’t otherwise be able to access.

Fetching this page, solving the CAPTCHAs and writing the file took almost two seconds. This may not seem like much, but we just crawled 10 Quotes pages in 1.325 seconds (.1325 seconds per page). This is a drastic difference at scale, but often a necessary evil to get your desired data.

Web Unlocker CAPTCHA Output

Here’s an actual shot of the HTML file from my browser. As you can see, the CAPTCHAs are gone. They’ve been solved.

Empty Page: The CAPTCHAs have been solved

How this process works

When you scrape the web, you’re following a set pattern. You start with an HTTP request. Then, you parse the response — and if needed — you interact with the page.

1. Making an HTTP request

If you’ve been following along, you’ve already made several GET requests. When you visit a site with your browser, you perform a GET request behind the scenes. When you used Axios, you also made a GET.

With Playwright, we used page.goto().

await page.goto("https://quotes.toscrape.com");

With Axios, it’s less abstract. We use axios.get().

//quotes example
const response = await axios.get(currentUrl);
//captcha example
const response = await axios.get(url, {
            httpsAgent: agent
        });

2. Parsing the response

When you parse the response, you write logic to find elements on the page and extract their data. You’ve done this with both Playwright and Cheerio.

Here is our main parsing logic from the Playwright example.

const quoteCards = await page.$$("div[class='quote']");

for (const quoteCard of quoteCards) {
    //find the actual quote
    const quoteText = await quoteCard.$("span[itemprop='text']");
    //find the author
    const author = await quoteCard.$("small[itemprop='author']");

    //build a quote object from the text
    quote = {
        text: await quoteText.innerText(),
        author: await author.innerText()
    }

            //add it to the extracted data
            extractedData.push(quote);
        }

If you look below, you can view the equivalent code using the static parser.

//get the HTTP response
const response = await axios.get(currentUrl);
//load its "data" object into cheerio
const $ = cheerio.load(response.data);

//iterate through the quotes on each page
$("div[class='quote']").each((index, quoteCard) => {
    const quoteText = $(quoteCard).find("span[itemprop='text']").text().trim();
    const author = $(quoteCard).find("small[itemprop='author']").text().trim();
    const quote = {
        text: quoteText,
        author: author
    };
    extractedData.push(quote);
});

3. Interacting with the page

Direct page interactions on the page require a headless browser. That said, you can simulate page interactions if you parse and perform your requests correctly. Take a look at our page interaction snippets with Playwright.

This is the part of the code where we click the login button to load the login page. Then we find the password and username input boxes. We use the fill() method to fill these boxes with our inputs.

//find the login button and click on it
var loginButton = await page.$("a[href='/login']");
await loginButton.click();

//find the username and password boxes
const usernameBox = await page.$("input[id='username']");
const passwordBox = await page.$("input[id='password']");

//fill out the boxes
await usernameBox.fill("my-username");
await passwordBox.fill("my-password");

//find and click the login button to submit the inputs
loginButton = await page.$("input[type='submit']");
await loginButton.click();

When we’re finished parsing the page, we actually run one last page interaction. We check to see if the page contains a “next” button. If it does, we continue our parsing loop.

nextButton = await page.$("li[class='next'] > a");
if (nextButton) {
    await nextButton.click();
}

If you look at the snippets below, you can view how we simulated basic interaction with the static parser. We find the href attribute from the “next” button. If one is found, we combine it with our base url to create the url of the next page. If not, we assign the current url to null so we can exit the loop.

//find the link to the next page
const nextLink = $("li.next > a").attr("href");

//if it doesn't exist, convert our currentUrl to null and exit the loop
currentUrl = nextLink ? `${baseUrl}${nextLink}` : null;

Choosing a tool: Features and criteria

When you choose a scraping tool, choose the tool that fits your task. What are you extracting? Do you need to interact with the page?

Base your choice on the actual requirements, not the popularity of the tool.

Parsing and data extraction

If you’re dealing with a static page, use a barebones HTML parser. This is doable with a headless browser like Playwright, but it’s overkill. Choose a static parsing library like you did with Cheerio.

JavaScript rendering and browser automation

For rendering and automation, you need a headless browser. This is the only time it’s an absolute requirement. If you need to click buttons, fill input boxes or load dynamic content, use a headless browser. In this tutorial, we used Playwright, but Selenium and Puppeteer are also great options.

Scalable data collection

For large-scale scraping, you should default to static parsers. This keeps your costs and resource usage low. Use headless browsers only when needed — they’re resource intensive and shouldn’t be overused.

If you’re looking for a scalable, full-featured framework, take a look at node-scrapy.

Efficient web scraping with JavaScript: Tools, techniques & when to use them

Scraping with JavaScript can help reduce overhead and give you a more web-native experience. There are a variety of tools available to meet your needs — no matter what they are. Always remember to choose the right tool for the right job. Static parsers should be your go-to tool. Only run a headless browser when you absolutely need it. JavaScript can lighten your scraping workload and help streamline the extraction of your training data.