Skip to main content

Annotating and validating web data for AI with human-in-the-loop workflows

Learn what data annotation is, why it's essential for machine learning, and how HITL workflows and in-house tools improve label quality and accuracy

Web data is one of the biggest sources of training material for AI systems. It’s abundant, easy to collect and covers virtually every domain of human knowledge. However, web-scraped data is unstructured by default, so you need to clean and label it before using it for training.

Automated tools and large language models (LLMs) can help with this, but they’re not enough. They miss edge cases (like sarcasm or ambiguous language) and struggle with nuance in tone. For this reason, human annotation for AI datasets is still important.

This guide explains how to add human annotation and validation to your web-scraped data workflows. We’ll walk through the tools, strategies and key decisions involved in making scraped data cleaner and more reliable.

What is data annotation?

Data annotation is the process of labeling raw data like text, images, audio or video so that machine learning models can learn from it. For example, this could mean marking the sentiment of a sentence or drawing a polygon around an object in an image to help a model learn where things are.

 Text vs image annotation

Why is data annotation important?

Most machine learning models, especially supervised ones, need labeled examples to learn what to predict. Without annotated data, a model has no clear way to connect input patterns with expected outputs. Even the most advanced LLMs, including GPT from OpenAI, Claude from Anthropic and Gemini from Google, rely on annotated data. They were trained on massive datasets that include labeled prompts, classifications and human-ranked responses. Annotation gives models the structure they need to learn, generalize and improve over time.

Types of data annotation

There are several types of annotation across different data types:

  • Text annotation: Focuses on labeling written content. It includes tasks like identifying entities (e.g., names or dates) and tagging sentiment (e.g., positive or negative tone).
  • Image annotation: Involves marking visual elements within static images, such as objects (e.g., people in a group photo; or a dog) or identifying key points (e.g., facial landmarks).
  • Video annotation: An extension of image annotation that applies the labeling process across multiple frames to capture actions or events over time.
  • Audio annotation: Deals with spoken content, including transcription and tagging attributes like speaker identity or emotion.

Annotation can be done in several ways. You can pre-label data using rule-based systems, heuristics or machine learning models, then have humans validate those labels. When accuracy really matters, you can also assign the labeling entirely to human annotators as part of a human-in-the-loop (HITL) workflow.

What is a HITL workflow?

HITL is a workflow where humans and machines work together to improve the quality and reliability of labeled data. HITL workflows are important when the data is ambiguous or full of edge cases. For example, when labeling scraped product reviews, AI might handle straightforward sentiment just fine but struggle with sarcasm, mixed opinions or vague language like “It works… I guess.”

With a HITL workflow, human annotators can catch edge cases and ensure that annotations reflect the true meaning of the data and stay aligned with your model’s goals. Beyond fixing AI mistakes, a human-in-the-loop workflow also enables complex annotation tasks that require human judgment from the outset. For example, visual data like annotated images or lidar data often requires drawing bounding boxes, tracking key points or performing detailed image segmentation. These tasks require precision and context awareness that current AI models can’t consistently deliver.

Data annotation techniques

There are two major ways to handle annotation: In-house or with an external provider.

In-house annotation with custom or open-source tools

Running annotation in-house gives you full control over your labeling process. You decide how tasks are defined, who sees the data, how edge cases are handled and how fast annotation is completed. This works best when your data is sensitive or you need a tight feedback loop between annotators and your ML team.

Here are a few practical ways to run in-house annotation:

Spreadsheets and internal UI

For straightforward tasks like sentiment tagging or binary classification, a basic web interface connected to a spreadsheet or small database is usually enough. You can build lightweight systems using tools like Streamlit or Flask that show one item at a time, let annotators choose a label and store everything in a local CSV or SQLite file.

Alternatively, you can skip building from scratch and start with an open-source tool. One example is the streamlit-annotation-tool, a lightweight interface built with Streamlit that lets you label data right in your browser.

 Streamlit’s text labeling tool

This tool can save you time during early prototyping or small-scale projects. It’s also easy to customize, so you can tweak the interface or labeling logic to fit your specific data and use case.

Label studio

Label Studio is an open-source platform that supports text, image, audio, video and multi-modal annotation. It runs in the browser and supports custom templates, task queueing and user management. You can also export data in multiple formats.

To get started, install it with pip and run the local server:

pip install label-studio

label-studio start

From the web interface, you can upload your data in JSON or CSV format, define custom label configurations and assign tasks to reviewers. You can also connect it to an AI model endpoint to pre-label data and then send the pre-labeled data to human annotators for correction or review.

Here’s an example of sentiment annotation using the Label Studio playground.

Label Studio Playground

Label Studio also supports consensus scoring, review queues and multi-step pipelines, which makes it a solid choice for teams who want something more flexible than a spreadsheet but still want to host and manage everything internally.

Other worthy mentions include:

  • Doccano: A web-based tool for text classification, sequence labeling and translation tasks
  • Label Sleuth: An open-source tool developed by IBM that helps non-experts label and explore text data with ML assistance

Building and maintaining your own annotation system takes time. If you’re working with hundreds of thousands of examples or don’t have the capacity to manage annotators and tools, in-house setups can slow you down. In those cases, a managed external provider is the better option.

Using external providers

If you don’t have the time or resources to build and manage your own annotation workflow, external providers can take that off your plate. These companies offer managed platforms, large pools of trained annotators and the infrastructure needed to label massive datasets across different formats.

Also, most external providers include APIs, web tools and quality checks. Some even offer LLM-powered pre-labeling or multi-step reviews to boost speed and accuracy. Here are a few well-known options:

  • Scale AI: Known for delivering high-quality data labeling at scale. It’s frequently used for complex multi-modal tasks such as 3D perception, video annotation and reinforcement learning pipelines. Companies like OpenAI and Meta have reportedly relied on Scale for large-scale training and evaluation datasets.
  • Sama: Focuses on ethically sourced, socially responsible data annotation with built-in QA workflows. It supports a wide range of tasks, including NLP, computer vision and document-level labeling. Google, Nvidia and other large organizations have also worked with Sama for structured and high-quality data at scale.
  • Surge AI: Specializes in high-quality text annotation and LLM-specific datasets. It’s a go-to for teams fine-tuning foundation models or collecting preference data for reinforcement learning from human feedback (RLHF).
  • Toloka: Provides flexible and global crowd annotation with customizable workflows. It’s commonly used for tasks such as content classification, ranking, search relevance and tagging.
  • Labelbox: Combines manual annotation with model-assisted workflows, making it easy to integrate human input into your training pipelines. It includes built-in analytics, quality control features and review queues.

It’s worth mentioning that even if you outsource annotation, you still need to stay closely involved. You’re in charge of defining tasks and keeping quality in check. That means you need to give clear instructions so annotators know exactly what to label. You should also add test examples to catch mistakes, check their work often and look out for things like labels that don’t match or changes in how the data is being tagged over time.  

Major challenges in data annotation

A few things make annotation more difficult as you scale beyond a small batch. Let’s cover some of the most common challenges. 

Cost

Scaling annotation costs money, no matter how you do it. If you build your own setup, you’ll spend engineering time on tools, quality checks and fixing bad labels. If you outsource, you’re paying for speed, volume and vendor fees. In the end, you have to choose what works best for your team. You can either invest in more control by handling it in-house or spend more to move faster with outside help.

Quality

Even with clear instructions, people make mistakes when labeling data. They might rush or get tired. If you don’t have strong quality checks in place, you can end up with bad data that looks fine at first but causes problems later. That’s why regular reviews, clear guidelines and multi-step checks are important.

Bias

Bias is another risk. If labels come from a narrow point of view or treat certain groups unfairly, those issues get built into the model. To avoid that, your labeling guidelines should respect differences in language, culture and identity.

Final thoughts

In this guide, we explored how to bring human annotation and validation into web-scraped data workflows. We walked through the tools, tradeoffs and key decisions involved in turning messy, scraped content into clean, reliable training data.

Web-scraped data is easy to collect but unstructured by default and hard to treat as high quality. If you’re building real models, you need to clean and label them properly. You can build the pipeline yourself or bring in external help. What matters is owning the process.