Web text is one of the richest sources of training data for language models. This kind of content reflects how people naturally write, explain and ask questions in everyday contexts, making it valuable for natural language processing (NLP), generative AI systems and machine learning applications.
But raw web data is rarely usable in its original form. It is noisy, repetitive and cluttered with unnecessary elements like navigation menus, boilerplate templates and duplicate data blocks. A typical website may contain redundant copies of the same file or page stored under different URLs. These redundant entries increase the storage needed, reduce free space in your storage system and negatively affect application performance. Without deduplication, normalization or logical data chunks, large datasets quickly become inefficient to process and difficult to train on.
This guide walks through how to build clean, scalable datasets from websites using a structured data pipeline with a deduplication process. We’ll cover practical techniques for web scraping, content extraction, cleaning, deduplication and formatting. You will learn how to:
- Identify high-signal, human-written sources for training
- Extract readable content using static and dynamic web scrapers
- Remove boilerplate, redundant text and detect duplicates early
- Enrich text with metadata like language, title and source URL
- Structure and output results as JSONL for integration into AI pipelines, data lakes and enterprise systems
Each step includes code samples and techniques suited for real-world machine learning workflows where performance, efficiency and precision matter.
Not all web text is worth scraping
“Garbage in, garbage out” is a harsh reality when you’re building datasets for language models. Many web pages are packed with ads, spam or templated filler. This kind of content adds volume without signal, leading to data duplication, redundant entries and unnecessary load on your storage system. It also increases the cost of cleaning and reduces the effectiveness of large datasets in downstream tasks.
A better approach is to focus on text that’s human-written, unique and rich in structure and intent. Technical blogs, how-to guides, open documentation, forum threads and product reviews are solid bets. These sources are usually well-organized, goal-driven and based on real user problems, explanations and interactions, making them ideal for generative AI and natural language models.
To make these decisions more objective, consider simple heuristics:
- Word count thresholds (e.g., filter out pages under 100 words)
- Tag structure (e.g., prioritize <article> and <main> blocks over sidebars)
- Readability scores (e.g., Flesch-Kincaid to identify overly terse or verbose content)
- Duplication ratio (e.g., number of repeated lines or elements)
These filters help any web scraper reduce noise and avoid storing duplicate or low-value records from the website. They also prevent downstream duplication issues in backup applications or virtual environments.
Avoid scraping cookie banners, footers, navigation menus or auto-generated comment threads. If a page feels cluttered to a human, it is likely not worth feeding into a model either. Start with trusted sources like Common Crawl, GitHub wikis or domain-specific archives, especially those used in enterprise applications or data lakes.
Overall, strong source selection reduces wasted bandwidth, lowers storage requirements and builds cleaner datasets from the start.
Preparing web text for NLP and LLM pipelines
After identifying high-quality sources, the next step is to transform raw web content into structured data that large language models (LLMs) can comprehend. This involves a sequence of practical tasks: Extracting visible text, removing irrelevant elements, enriching content with metadata, detecting language, eliminating duplicates and formatting the output into model-friendly structures.
Each stage enhances data quality, reducing redundant data and improving storage utilization. The final dataset becomes easier to scale, less prone to duplication issues and better suited for NLP applications and large-scale model training.
Extract clean text from pages
Once your sources are chosen, the next step is to extract the content from them. This involves navigating the HTML structure, interacting with the page as needed and extracting the relevant text or data.
There are two main approaches to web scraping:
- Static scraping works when all the content is already present in the HTML. Tools like requests and BeautifulSoup are fast and simple, but break when data is loaded with JavaScript.
- Dynamic scraping uses a browser automation tool like Playwright or Selenium. These tools can interact with the page, click buttons, wait for content to load and extract it after rendering.
Suppose you need to scrape product listings from a page that loads content only after interaction. A good example is scrapingcourse.com/button-click, which delays product data until a “Load More” button is clicked.
This mirrors real-world cases like e-commerce, job boards and property sites, where JavaScript renders content dynamically. These listings are precisely the type of structured web data utilized in NLP pipelines for tasks like entity extraction and classification.
Here’s how to do it with Playwright:
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=True) # Set headless=False to debug visually
page = browser.new_page()
page.goto(“https://www.scrapingcourse.com/button-click”)
print(“Initial page loaded. Clicking ‘Load More’ button…”)
page.click(“button:has-text(‘Load More’)”) # Click the button
page.wait_for_selector(“.product-item”, state=”attached”, timeout=10000) # Wait for items to load
print(“New products loaded. Extracting data…”)
product_elements = page.locator(“.product-item”).all() # Get all product items
print(f”Found {len(product_elements)} product items.”)
for i, product_element in enumerate(product_elements): # Extract name and price
try:
name = product_element.locator(“.product-name”).text_content()
price = product_element.locator(“.product-price”).text_content()
print(f”Product {i+1}: Name='{name}’, Price='{price}'”)
except Exception as e:
print(f”Error extracting product {i+1}: {e}”)
browser.close()
print(“Browser closed.”)
This gives the output:
Dynamic pages often include anti-bot protection, CAPTCHAs or unpredictable structure. These challenges are common. Use throttling, wait conditions and retries to keep your scraper stable. Once the content is captured, the next step is to remove the surrounding noise while maintaining a meaningful structure.
Strip the noise without losing meaning
Boilerplate elements, such as ads, navigation bars, footers and buttons, add clutter without value. These should be removed to isolate the core content.
In the earlier scraping example, the output included 12 product entries with readable names and prices, but not in a structured format. A quick post-processing step can convert this into clean, normalized records ready for storage, labeling or pipeline use.
products = []
for line in raw_output:
name = line.split(“Name='”)[1].split(“‘,”)[0].strip()
price = float(line.split(“Price='”)[1].replace(“$”, “”).replace(“‘”, “”))
products.append({“name”: name, “price”: price})
The result is a list of clean, structured dictionaries with all the “noise” (like Product X:, Name=’, Price=’, $) removed, leaving only the necessary data.
This format supports use cases such as product classification, similarity search or prompt-based LLM tasks. It also keeps the dataset lean and focused.
For full-page HTML extractions, these tools come in handy:
- trafilatura for long-form structured content
- readability-lxml for article-style layouts
- newspaper3k for title, text and metadata extraction
Normalization should also cover the following:
- Encoding (remove unexpected unicode characters)
- Whitespace (collapse repeated spaces)
- Punctuation (standardize where needed)
Maintain meaningful formatting like headings, lists and code blocks to preserve structure for models that rely on context. The goal is to clean out what distracts and retain what informs.
Enrich and filter for contextual value
Raw text alone is rarely sufficient for training or evaluating modern NLP systems. Enriching the extracted content with metadata such as language, URL and source title makes it easier to filter, label or group examples by context. This step also helps in applications like semantic search, retrieval-augmented generation (RAG) and topic-specific modeling.
For instance, from the structured output from the scraped site, you can enrich each product entry by adding:
- Name and Price (already captured)
- Language (detected using langdetect)
- Page Title (via Playwright’s .title() method)
- Canonical URL (or fallback to .url if not available)
from playwright.sync_api import sync_playwright
from langdetect import detect
products = []
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto(“https://www.scrapingcourse.com/button-click”)
print(“Initial page loaded. Clicking ‘Load More’ button…”)
page.click(“button:has-text(‘Load More’)”)
page.wait_for_selector(“.product-item”, state=”attached”, timeout=10000)
print(“New products loaded. Extracting data…”)
try:
page_title = page.title()
except Exception:
page_title = “”
try:
canonical_url = page.locator(“link[rel=’canonical’]”).get_attribute(“href”)
if not canonical_url:
canonical_url = page.url
except Exception:
canonical_url = page.url
product_elements = page.locator(“.product-item”).all()
print(f”Found {len(product_elements)} product items on the page.”)
for i, product_element in enumerate(product_elements):
try:
name = product_element.locator(“.product-name”).text_content().strip()
price_text = product_element.locator(“.product-price”).text_content().strip()
print(f”Product {i+1}: Name='{name}’, Price='{price_text}'”)
price_cleaned = price_text
detection_input = f”{name} {price_cleaned} {page_title}”
try:
language = detect(detection_input)
except Exception:
language = “unknown”
products.append({
“name”: name,
“price”: price_cleaned,
“language”: language,
“title”: page_title,
“url”: canonical_url
})
except Exception as e:
print(f”Error extracting product {i+1}: {e}”)
browser.close()
print(“Browser closed.”)
print(“\n— Enriched Products (Structured) —“)
for item in products:
print(item)
This gives an enriched output that looks like this:
Here are some other tips to keep in mind when enriching your data.
- Language detection on short strings is unreliable. Add more context like price or page title to stabilize the output.
- Capturing canonical URLs and page titles adds traceability and structure, enabling better deduplication, filtering and downstream tasks like content grouping or multilingual retrieval.
Once enrichment is complete, the dataset is more structured and searchable. However, before using the dataset for training or retrieval, it’s worth checking whether some entries appear more than once, either subtly or precisely. That’s where deduplication comes in.
Catch Duplicates Before They Pollute the Dataset
Even well-structured scraping outputs can include duplicate entries, particularly on dynamic sites where the same content is appended multiple times. Redundancy adds clutter, increases storage requirements and can later on affect stages of data analysis.
In the scraping output, items like ‘Grayson Crewneck Sweatshirt’ and ‘Ajax Full-Zip Sweatshirt’ appeared more than once with identical values. Keeping them all would artificially skew frequency, weight and downstream model behavior.
To ensure the dataset remains lean and reliable, a deduplication step is needed. This can be done by converting each product into a hashable representation and filtering based on uniqueness:
# Remove duplicates based on name and price
unique_products = []
seen = set()
for item in products:
key = (item[‘name’], item[‘price’])
if key not in seen:
seen.add(key)
unique_products.append(item)
print(“\n— Deduplicated Products —“)
for product in unique_products:
print(product)
This filter checks for duplicates based on the combination of product name and price, which are the most stable identifiers across multiple loads. Other fields like URL or page title are constant for all products in this scraping task and do not help detect repetition.
After deduplication, the list becomes better suited for analysis, search indexing or machine learning input, free from inflated counts and noise introduced by duplicate entries. Now your data is ready for the next step: Structuring for NLP and LLM
Structure your data for LLM and NLP pipelines
With enriched product dictionaries in place, the next phase is to prepare the data in a format that downstream pipelines can directly consume. This means moving beyond structured Python objects into standardized serialization formats such as JSONL or Markdown.
A recommended format is JSON Lines (JSONL), where each record is a separate JSON object on a new line. This enables streaming, parallel processing and compatibility with tools like Hugging Face datasets, LangChain retrievers or prompt-tuned indexing pipelines.
Use the snippet below to convert the enriched and deduplicated product list into proper JSONL output:
import json
# Save enriched, deduplicated products as JSONL records
with open(“products.jsonl”, “w”, encoding=”utf-8″) as f:
for product in unique_products:
jsonl_record = {
“title”: product[“title”],
“text”: f”{product[‘name’]} – {product[‘price’]}”,
“url”: product[“url”],
“language”: product[“language”]
}
f.write(json.dumps(jsonl_record, ensure_ascii=False) + “\n”)
# Preview first few JSONL entries for verification
print(“\n— JSONL Export Sample —“)
with open(“products.jsonl”, “r”, encoding=”utf-8″) as f:
for i, line in enumerate(f):
if i >= 3: # Limit preview to first 3 records
break
parsed = json.loads(line)
print(json.dumps(parsed, indent=2, ensure_ascii=False))
Each record should include minimal but complete metadata:
If your pipeline needs to handle long documents (e.g., scraped articles), consider:
- Chunking: Split content into context-aware sections to fit model limits.
- Overlapping: Maintain a small overlap between chunks to preserve coherence.
- Logical boundaries: Break by paragraph, heading or sentence, not arbitrary tokens.
While the current dataset (product listings) is already short and well-structured, adopting a consistent format like JSONL ensures easier storage, annotation and integration into modeling pipelines.
Bringing it all together
A well-designed data pipeline is engineered like infrastructure. Each stage should be modular, testable and built for long-term reliability. This includes fetching, cleaning, enriching and formatting, so that what enters your system is consistent, structured and usable at scale.
Break your pipeline into reusable stages:
- Source carefully, focusing on high-quality, well-structured sites and applying filters to select content that adds value and reduces noise in your dataset.
- Clean thoroughly, removing noise while preserving text that carries meaning.
- Enrich the data with metadata such as language, title, canonical URL and source.
- Deduplicate aggressively, eliminating redundant data that increases storage costs and dilutes model performance.
- Format efficiently, using structures like JSONL that work with large datasets, backup applications or data lakes.
Plan for growth:
- Add logging, monitoring and version control to track failures and process drift.
- Watch for shifts in page templates or network bandwidth usage that affect consistency.
- Audit outputs regularly and maintain dataset versions for traceability and compliance.
Brute-force scraping creates hidden liabilities: Duplicate content, inefficient storage and unstable performance in downstream applications. A thoughtful pipeline reduces unnecessary duplication, improves storage utilization and ensures better outcomes across NLP, LLM and enterprise use cases.
If your model feels off, the issue might not be your prompts. It might be your pipeline. Build pipelines that scale with your ambitions. Clean, diverse and structured web text is the foundation for the next generation of artificial intelligence systems.