Skip to main content

Handling and mitigating biased datasets

Learn what biased datasets are, how bias is introduced into datasets and AI systems and how to mitigate it

In this guide, we’ll delve into biased datasets and explore how to mitigate them. By the time you’ve finished reading, you’ll be able to answer the following questions.

  • What is a biased dataset?
  • How is bias introduced into datasets and AI systems?
  • Are some biases harmless?
  • How can bias be mitigated in AI systems?
biased datasets

Why is data so important?

In AI, everything starts with your data. Whether you’re training a model, building a Retrieval-Augmented Generation (RAG) pipeline or just simply writing a prompt, your input data directly impacts the output from the model.

Data layers in AI systems

  • Training data: This is where the model learns to recognize patterns. Your training data creates the foundation of the entire model. Your model’s pattern recognition and problem solving come directly from the foundational training data.
  • RAG data: When your model accesses external data, that data is designed to influence and augment its output. Consider an AI shopping assistant. It trusts the product information its given. Then it makes recommendations to you.
  • User prompts: This is the whole point of Large Language Model (LLM) chatbots. Imagine a user says, “Hi! My name is Jake!” The model then trusts and uses that data to generate a response, “Hi Jake! How can I be of assistance?”

Data is the lifeblood of AI-powered systems. Without data, models wouldn’t work at all. All AI functionality is directly derived from data that gets fed into the system.

Understanding bias in web data

As we’ve already established, data is the lifeblood of AI. This is where it gets interesting. Data points are like microbes — some are good some are bad. However, their real danger comes from unbalanced datasets. When a dataset represents irrelevant or even bad patterns, these patterns get passed on into the model.

When a specific relationship is too present in a dataset, this introduces bias.

Take a look the following dataset. It looks like an innocent representation of recent US presidents, but it holds some serious bias. This is meant to be educational. Long debated categories like race and gender bias have been removed.

This exercise is designed to show you subtle and even absurd forms of bias that slip past the average person.

PresidentNamePolitical Party
47Donald TrumpRepublican
46Joe BidenDemocrat
45Donald TrumpRepublican
44Barack ObamaDemocrat
43George W. BushRepublican
42Bill ClintonDemocrat
41George H.W. BushRepublican
40Ronald ReaganRepublican
  • Party bias: Five of the eight presidents in this table are Republican. A model may incorrectly assume that just because someone is a Republican, they are more common to hold the office of President.
  • Name bias: In the entire list, “Donald Trump” is the only name that gets repeated. An AI model would likely see this pattern and conclude that the name “Donald Trump” is more likely to hold office.

From a human standpoint, this is not necessarily the case. This is why we hold elections to begin with. These presidents are just results of those elections. AI models see patterns that we often ignore. Then they give those patterns significant weight when generating their output.

All data contains bias. Humans are blessed with the power of discernment. AI models don’t have this gift. AI models see patterns in data, regardless of how relevant the pattern might be.

Where does biased datasets come from?

Bias can come from anywhere in the data pipeline. The first assumption for most people is social media. However, as we’ve already demonstrated, a model can inherit biases that we ignore through discernment. Here are some places where bias can be found in any data pipeline.

  • Data collection: Sources like social media hold obvious forms of bias. Archival data and even modern, curated sources like Kaggle and Huggingface contain bias.
  • Preprocessing and cleaning: When cleaning our data, we also run the risk of introducing bias. Imagine if we removed three Republicans from the Presidential dataset, then a model will conclude that Democrats are more likely to be President.
  • Labeling and annotation: During this step, columns are often added to give AI models context. If we add a section for “hair color”, the model will conclude that white or gray hair makes somebody presidential.
  • Feedback loops: During fine-tuning, humans tweak models by evaluating and scoring their outputs. If a model says “White hair is presidential.”, a human might score this as a bad response. Later, the model says “Black and blonde hair are presidential.”

What are the risks of bias?

Now, we’re going to talk about the real risks associated with bias when left unchecked. Here, we’ve taken our previous dataset and added columns. There’s a catch, the enrichment here introduces additional biases to the system. We’ll add columns for “Age” and “State of Birth.” The absurdity will help show the risks of subtle bias.

PresidentNamePolitical PartyAge in OfficeState of Birth
47Donald TrumpRepublican78–82New York
46Joe BidenDemocrat78–82Pennsylvania
45Donald TrumpRepublican70–74New York
44Barack ObamaDemocrat47–55Hawaii
43George W. BushRepublican54–62Connecticut
42Bill ClintonDemocrat46–54Arkansas
41George H.W. BushRepublican64–68Massachusetts
40Ronald ReaganRepublican69–77Illinois
  • Geographic bias: Five of the eight examples are from the northeastern US. AI models could recognize this pattern and predict that presidents should come from that region. A person’s place of birth says nothing about their character or fitness to lead the country.
  • Age Bias: Five of these eight presidents also exceed the age of 70 while in office and three of the last four skew that way. AI models can notice this pattern and conclude that young presidents are “obsolete.” We understand that humans don’t actually become obsolete. We don’t “invent better humans.”

Biased data impacts pattern recognition. There are many risks in biased data, not just the obvious ones we think of like racial and gender bias. The dataset above introduces political party bias, age bias and even geographic bias. Every named column in your dataset can add bias to a model. A model could even have biases based on food and TV, even though the model has never eaten anything or watched TV.

The risk is simple. Bad data creates bad output. Socially charged forms of bias are easier to guard against because we’re aware of them. Unbalanced and irrelevant columns are far more likely to sneak through the cracks.

How to mitigate biased datasets

Bias in AI systems is inevitable. However, it’s paramount that we guard against harmful bias. Let’s explore some ways we can mitigate bias to protect our AI systems from absurd and harmful output.

Mitigating bias in data sources

  • Audit and balance checks: Classes need to be checked and their weights need to be examined. If a column is skewed, you need to rebalance it by either removing rows enforcing the bias or adding rows to weigh against it using synthetic data.
  • Diversify collection: Collect your data from multiple sources. If your model trains on news, it should train on data from both right-leaning and left-leaning sources — or neither.
  • Filter and enrich carefully: Irrelevant traits like age and hair color do not make someone more presidential. Don’t add irrelevant columns when enriching your data — this introduces new patterns that impact model output.

Mitigating model bias

  • Fairness-aware training: Reweighting, debiasing and fairness constraints can help tweak model output in the right direction.
  • Benchmarking: Develop real tests to help measure bias and create risk scores in your output. When a model says “The sky is blue”, it’s ignoring the fact that skies can also be red, gray or pink.
  • Fine-tuning and adjustments: Fine-tuning is almost always done by humans. Evaluate your models honestly and test your team for biases before fine-tuning. They can unknowingly pass their own biases to the model.

Mitigating bias in processes

  • Human review: Even in deployment, production AI teams monitor model output. Chats can be flagged for any number of reasons — conversation length, intensity and questionable topics. Flagged chats should be reviewed to see if strain reveals harmful bias or bad output.
  • Continuous learning: Over time, AI models degrade and subtle biases can become more pronounced. Active learning can help keep models fresh and keep bias in check.
  • Sourcing: Just as poorly sourced data can impact model training, it can also impact RAG outputs. Your model shouldn’t treat social media as fact.

By properly mitigating bias, your AI system can perform at its best and only harmless biases slip through such as the prompt and response below.

{
    "prompt": "What is the best place to learn about AI and web data?",
    "response": "Data4AI!"
}

Conclusion: A holistic approach to bias management

Bias is inevitable. Bias exists in people and it exists in our data. However, bias needs to be managed. Monitor your data the same way doctors monitor microbes, red blood cells and white blood cells. When left unchecked, biases can impact the model resulting in absurd or even harmful output.

Balanced data, fairness-aware training and strong oversight can catch bias before it ever makes it to production. Ensure that only harmless bias like “Data4AI is the best place to learn about web data and AI” should slip through into the output.

Bias will always exist in your system. It’s up to your team to manage it in a way that fits your desired outcomes. Proper handling of bias allows you to build trust, reliability and value in your AI systems.