Skip to main content

Extracting and preparing image and video data from the web

Scraping web data for LLM training? This hands-on guide shows you how to extract, clean, enrich and structure web text that models can learn from

The growth of visual AI applications has created a demand for high-quality, diverse image and video datasets. Modern computer vision systems, multimodal AI models and generative AI platforms, require vast structured visual data to achieve human-level performance. However, scraping and preparing this data from the web presents unique challenges that go far beyond simple file downloading. 

In this comprehensive guide, we’ll explore the complete pipeline for scraping, processing and preparing web-scale visual datasets for AI applications.

Where to find, download image and video data on the web

The foundation of many visual AI projects begins with established public datasets. 

  1. ImageNet – millions of labeled images across thousands of categories
  2. Microsoft’s COCO dataset – extensive media collections for object detection, segmentation and captioning applications
  3. Google’s Open Images Dataset – close to 10 million images with detailed labels under a Creative Commons license

For cutting-edge multimodal AI development, LAION-5B represents a landmark achievement with billions of CLIP-filtered image-text pairs. This openly available dataset democratizes access to large-scale visual AI research.

Web-scale data extraction sources

Beyond curated datasets, the open web provides virtually unlimited visual content. E-commerce platforms, social media sites, news websites and multimedia repositories represent rich sources of diverse visual content spanning multiple domains and use cases.

Although the above is true, web scraping for images and videos requires sophisticated infrastructure for rendering dynamic content. If you encounter a straightforward static HTML website, navigating the DOM and locating media for download is a breeze. Let’s call this traditional scraping. 

On the other hand, responsive websites increasingly rely on JavaScript frameworks like React, Vue.js and a plethora of vanilla forks. For performance and experience optimization, most rich website data loads from resource variables. Although this makes the internet a more scenic experience, traditional scraping approaches are inadequate for comprehensive data extraction.

Cue the need for intentionally sophisticated scraping methods.

Techniques for video and image scraping

Although not entirely necessary, some technical experience setting up web scraping tools will help you swiftly progress through the coming sections. 

A Python3.x environment with ample processing capacity is a prerequisite. 

# confirm environment fit
python3 –version

Several Python packages will come into conversation. Using a virtual environment to isolate dependencies for such projects is always a safe practice. For these recommendations: 

# confirm pip3 installation
pip3 –version

If not installed, run the following in your machine’s terminal equivalent:

curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
python get-pip.py

Create a virtual environment:

python3 -m venv myproject

Activate on macOS/Linux:

source myproject/bin/activate

Activate on Windows:

myproject\Scripts\activate

Some of the JavaScript tools referenced later require Node.js 18.x. Installation and device-specific documentation for this are available on the official website. You can load the node packages used hereafter using the stock package manager (npm).

Essential Library Dependencies

Web Scraping Libraries used in the research and demonstrations in this guide include the following:

Beautiful Soup 4 installation:

The following pip commands fetched the required packages to scrap, automate and process images and videos from the web:

pip install beautifulsoup4

Beautiful Soup dependencies include lxml and html5lib parsers:

pip install lxml
pip install html5lib

Requests library for HTTP operations:

pip install requests

Selenium WebDriver for browser automation:

pip install selenium

Browser Automation Tools

# Puppeteer installation:
npm install puppeteer
# Playwright installation:
npm install playwright

Playwright will begin installing various browsers to run automation on, such as Chrome, Firefox and others.

Image and Video Processing Libraries

Installing OpenCV Python library:

## Main modules package
pip install opencv-python
## Full package with contrib modules
pip install opencv-contrib-python
# Python Imaging Library (PIL)
pip install Pillow

Homebrew is a tried and tested way to install and manage instances of FFmpeg for macOS.

brew install ffmpeg

While for Windows:

# Chocolatey
choco install ffmpeg
# Winget
winget install ffmpeg
# Scoop
scoop install ffmpeg

The general off-the-shelf machine should have sufficient storage and processing power to run the basic scraping tasks in this guide.

Static Content Extraction

You can effectively extract image sources using traditional scraping approaches using BeautifulSoup with Python’s requests library for websites serving static HTML content.

The basic approach involves parsing markup documents to locate <img> tags and extracting their src attributes:

from bs4 import BeautifulSoup
import requests
response = requests.get(url)
soup = BeautifulSoup(response.text, ‘html.parser’)
image_elements = soup.find_all(‘img’)
image_urls = [img[‘src’] for img in image_elements if img.get(‘src’)]

However, this approach often fails to capture images loaded through CSS background properties, SVG elements or data URIs embedded directly in the HTML.

Dynamic Content and JavaScript-Rendered Pages

Modern web applications frequently render visual content dynamically through JavaScript, requiring headless browser automation. Puppeteer and Playwright have emerged as the leading solutions for this challenge, providing programmatic control over full browser instances.

Playwright is particularly suitable for massive visual data extraction operations that require robust, reliable automation across different browser environments.

A typical implementation of an image extraction operation for https://www.example.com with Playwright would look like this:

const { chromium } = require(‘playwright’);
const browser = await chromium.launch();
const page = await browser.newPage();
await page.goto(‘https://example.com’);
// Wait for dynamic content to load
await page.waitForSelector(‘img[data-src]’);
// Extract image URLs including lazy-loaded content
const imageUrls = await page.evaluate(() => {
    return Array.from(document.querySelectorAll(‘img’))
        .map(img => img.src || img.dataset.src)
        .filter(Boolean);
});

Computer Vision-Enhanced Scraping

The methods demonstrated so far rely on the ‘img’ tag to detect media files. But what if content is rendered from links or CSS?

Advanced scraping tools can leverage computer vision techniques to identify and extract visual content more intelligently. For example, OpenCV can analyze page layouts and detect image regions, while YOLO models can perform object detection to identify specific types of visual content.

Selenium, for example, paired with computer vision libraries can create sophisticated automation workflows that can adapt to changing website layouts and interact with complex user interfaces.

Downloading and storing scraped media files

When carrying out a visual asset collection operation, a reliable download infrastructure should be in place to cater for the size of the files fetched. One capable of efficiently handling an influx of copious amounts of data. 

FFmpeg is the industry standard tool for video processing. It facilitates format conversion, quality adjustment and meta data extraction with a few standard commands.

# Video conversion and optimization with FFmpeg
ffmpeg -i input.mp4 -vf scale=1280:720 -c:v libx264 -crf 23 output.mp4
# Extract specific time segments
ffmpeg -ss 00:01:00 -t 00:00:30 -i input.mp4 -c copy segment.mp4

Storage Architecture

Effective storage strategies balance accessibility, cost and scalability. Cloud object stores like Amazon S3, Google Cloud Storage and Azure Blob Storage provide virtually unlimited capacity with global distribution capabilities. 

For local processing, organizing datasets with clear directory structures, facilitating both human navigation and programmatic access, is essential:

dataset/
├── raw/
│   ├── images/
│   └── videos/
├── processed/
│   ├── resized/
│   ├── normalized/
│   └── augmented/
└── metadata/
    ├── annotations/
    └── labels/

Scalability Considerations

The img2dataset tool demonstrates best practices for large-scale image downloading. It can process millions of video and image URLs in 20 hours on a single machine. Additionally, it supports ethical scraping practices and automatically respects opt-out directives, such as X-Robots-Tag: noai and X-Robots-Tag: noimageai headers.

Extracting Metadata and Structuring Datasets

Exchangeable Image File Format (EXIF) data provides crucial technical metadata including camera settings, timestamps, GPS coordinates and image dimensions. Using the exifTool, one can batch metadata extraction tasks across multiple file formats.

# Extract all metadata from an image
exiftool image.jpg
# Extract specific GPS coordinates
exiftool -GPS* image.jpg
# Remove sensitive metadata
exiftool -all= image.jpg

Contextual Metadata Extraction

Beyond technical specifications, contextual metadata significantly enhances dataset quality. Alt text, captions and surrounding HTML content provide valuable semantic information that can serve as natural language descriptions for multimodal training. Azure AI Video Indexer demonstrates advanced capabilities for extracting insights from video content, including speech transcription, object detection and scene analysis.

Dataset Structure and Organization

Properly structured datasets facilitate training efficiency and reproducibility. FiftyOne is a leading platform for visual dataset management. It provides visualization, evaluation and quality assessment tools. It integrates naturally with popular ML frameworks while offering advanced capabilities for handling complex labels, exploring failure modes and identifying annotation mistakes.

import fiftyone as fo
# Create and populate dataset
dataset = fo.Dataset(“visual_ai_dataset”)
dataset.add_dir(
    dataset_dir=”/path/to/images”,
    dataset_type=fo.types.ImageDirectory,
    label_field=”ground_truth”
)
# Add predictions and evaluate model performance
dataset.evaluate_detections(
    predictions_field=”predictions”,
    gt_field=”ground_truth”,
    eval_key=”eval”
)

Preprocessing Scraped Media for AI Readiness

After acquiring media content through scraping methods, the next step is to normalize and standardize it to make it AI-digestible. Normalization represents the foundational preprocessing step. It involves scaling pixel values to consistent ranges to facilitate model training.

Format Conversion and Resizing

Multimodal AI systems typically require standardized input formats and dimensions. OpenCV and PIL provide efficient tools for these operations. Here’s an image dimension resizing example:

import cv2
from PIL import Image
# Resize and normalize with OpenCV
img = cv2.imread(‘input.jpg’)
img_resized = cv2.resize(img, (512, 512))
img_normalized = img_resized / 255.0
# Convert formats with PIL
img = Image.open(‘input.jpg’)
img_rgb = img.convert(‘RGB’)
img_resized = img_rgb.resize((512, 512), Image.LANCZOS)

Deduplication Strategies

Large web-scraped datasets inevitably contain duplicate or near-duplicate content. Perceptual hashing, particularly the dHash (difference hash) algorithm, provides efficient duplicate detection:

def dhash(image, hash_size=8):
    # Convert to grayscale and resize
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    resized = cv2.resize(gray, (hash_size + 1, hash_size))
   
    # Calculate differences
    diff = resized[1:, :] > resized[:-1, :]
    return sum([2 ** i for (i, v) in enumerate(diff.flatten()) if v])

Advanced Preprocessing Techniques

Background removal enhances focus on primary subjects, which is particularly valuable for e-commerce and product recognition applications. Optical Character Recognition (OCR) extracts textual content from images, enabling rich multimodal datasets that combine visual and textual information. Modern Vision Language Models (VLMs) like GPT-4o integrate these capabilities, providing context-aware text extraction. The result is that they understand abbreviations and predict (and generate) missing information.

Labeling and auto-annotation

Image scraper labeling demo. Source: Label Studio

Label Studio provides support for computer vision annotation tasks. Its capabilities include semantic segmentation with polygons and masks, object detection with bounding boxes, keypoint labeling and image captioning. Using Label Studio for image processing falls under manual labeling approaches.

Automated Labeling Solutions

The auto_labeler library demonstrates the potential for zero-cost automated annotation using state-of-the-art computer vision techniques. 

It abstracts away complex algorithms and provides uniform interfaces for tasks including image classification, object detection and instance segmentation. The library uses models like CLIP, OWL-ViT-V2 and SAM-ViT.

Hybrid Approaches

Micro models represent an emerging approach that combines automated annotation prior to manual human verification. This strategy reduces manual labeling costs while maintaining accuracy. It is particularly effective for large-scale datasets where complete manual annotation is prohibitively expensive.

When scraping data from sources other than public repositories, it’s good practice to give attribution and consider the guidelines set by robots.txt documents.

Robots.txt Compliance

Robots.txt files provide website owners’ preferences for automated access. When using web scraping tools, look for features and configuration options that let you adjust automation activities to your own compliance and risk preferences, such as enabling robots.txt compliance or disabling CAPTCHA handling as needed.

Opt-Out Mechanisms

Modern scraping tools also increasingly support website opt out directives through HTTP headers like X-Robots-Tag: noai, X-Robots-Tag: noimageai and X-Robots-Tag: noimageindex. 

Attribution and Fair Use

Proper attribution, as said above, is good practice for web scraped visual content, particularly commercial applications. Rate limiting (mathematically defined as Max Requests / Time Interval) ensures controlled access that doesn’t overwhelm server resources. For example, a rate limit of 100 requests per 60 seconds translates to 0.6 seconds per request.

Building scalable visual data pipelines

Successful visual AI projects require end-to-end workflows integrating scraping, processing, annotation and validation stages. Here are a few tools you might want to consider adding to your stack:

Bright Data Video and Media Data Pipelines

Bright Data’s Video and Media Data Pipelines significantly advance automated visual content collection, and is specifically designed to power generative AI models and apps. A pipeline built with Bright Data inherits programmatic search, filtering and downloading of extensive video collections, audio, transcripts and rich metadata. The platform also provides a Web Archive API with access to over 100 billion cached web pages and massive archived datasets (including 2.3 billion videos and 365 billion image URLs, with 2.5 billion image and video URLs discovered daily) for AI training and analysis.

Building on these capabilities, the Unlocker API automatically renders web pages on a remote browser and navigates blocks and CAPTCHAs to ensure reliable video downloading. This approach makes it possible to reliably collect video and web data from sites that rely on JavaScript, while offloading infrastructure maintenance and management.

Together, these capabilities enable developers to create, access and leverage vast, diverse datasets for building cutting-edge generative AI solutions.

Oxylabs Video Data API

Oxylabs’ Video Data API is part of its broader Web Scraper API platform. It provides programmatic extraction of video content, audio files, transcripts and metadata primarily from YouTube. The API supports finding videos, channels and playlists via search, downloading video and audio, and retrieving structured transcriptions in multiple languages.

In addition to video-specific features, the Web Scraper API supports extracting data from other web sources in the same workflow, making it useful for teams who need to reliably collect and process both video data and associated metadata for large-scale AI and analytics projects.

The API can be integrated directly into Python or cloud-based pipelines. This approach can help teams assemble consistent, structured video datasets without manual intervention.

Decodo YouTube Scraper API

Decodo’s YouTube Scraper API is designed to search for videos, channels and playlists, extract video transcripts, download video and audio files and retrieve key metadata such as titles, tags, engagement metrics and resolution. Data can be delivered in structured formats like JSON or TXT, making it suitable for LLM training, content classification, sentiment analysis or recommendation models. 

Decodo’s platform also features separate APIs for social media data extraction. This flexible approach allows teams to coordinate video and social content gathering in one workflow, making Decodo a practical choice for projects needing efficient, centralized access to diverse multimedia sources.

Build for Monitoring and Maintenance

Large-scale visual datasets require ongoing maintenance to address format updates and quality degradation over time. Implementing robust monitoring systems helps identify issues early and maintains dataset reliability for production AI applications.

Extraction, processing and presenting visual input for AIs

Extracting and preparing image and video data from the web for visual AI applications presents a sophisticated engineering challenge that extends beyond simple file downloading. Success requires careful attention to scraping methodology, metadata preservation, preprocessing standardization and the design of scalable infrastructure.

The joint developments of advanced headless browser automation, computer vision-enhanced scraping techniques and automated annotation tools opens up access to high-quality visual datasets. However, the most impactful visual AI systems will continue to emerge from teams that master the complete pipeline.

As visual AI capabilities continue advancing, the teams and organizations that invest in robust and scalable data extraction pipelines will be best positioned to forge the next generation of computer vision and multimodal AI systems. The foundation of visual AI excellence lies not just in sophisticated algorithms but in the quality, diversity and ethical integrity of the datasets that fuel them.