AI is developing faster than ethical frameworks can keep up. In this piece, we seek to examine some of the ethical nuances that AI is currently impacting. AI integrations are showing up everywhere from travel agents to defense systems.
One thing we often forget is that AI mirrors its training data. This article attempts to answer the following questions.
- Why does all data contain bias?
- What are some common sources of data contamination?
- What’s the difference between public and private web data?
- What are some common concerns surrounding AI data?
- How does this impact intellectual property (IP) and likeness rights?
- How can you responsibly source your own AI web data?
Artificial intelligence and morality
When we think of AI data, we often stop at “legal vs. illegal” rather than “right and wrong.” This outlook rapidly accelerates development but often overlooks some of the finer nuances in AI and the greater software industry today.
Before moving on, we need to establish a few things.
- AI-powered decisions: AI models are excellent for risk mitigation. They are not an adequate substitute for basic human moral agency. Even the best models need human review. These models can help you think. They are not a substitute for your moral compass.
- Models reflect their data: AI models are built from training data. Their outputs and performance mirror this data. “Garbage in = garbage out” — this saying is as old as software itself. This concept is just as relevant today as it was in 1970.
- Public web data: When data is public, users can access it without using a login or some other form of authentication. Public web data is generally considered acceptable for data pipelines. However, teams need to practice proper curation even with public data.
- Private web data: If a login or auth token is needed for access, the data is considered private. Private data belongs to the company providing you with access. Responsible teams avoid using private data in their AI systems.
- Likeness: In the modern world, people retain the rights to their likeness. If somebody uses your profile picture without your consent, they are using your likeness without your permission. This is a major area of concern as more models use image data.
- Intellectual property: When you publish works online, they are your intellectual property. While public, GitHub repositories belong to their developer. Blog posts, trademarks and web sites belong to their publishers.
When we talk about ethical AI, we need to take all of these things into account.
Bias in web data
In data, bias is unavoidable. However, it’s important to separate harmful things like gender bias from acceptable forms of bias. Acceptable bias reflects a state of normality without harming people.
Take a look at the statement below.
The sky is blue.
Most people don’t bat an eye at this one. We generally treat blue skies as a fact. In reality, this isn’t true at all. Depending on weather, skies can be blue, green, red, orange, gray and many other colors.
- Blue: Clear skies during daytime are often blue.
- Green: Before and during major storms, the sky can turn green.
- Red: “Red sky at night, sailor’s delight. Red sky in the morning gives the sailor a warning.” Red skies are also commonly associated with strong weather activity.
- Orange: Whenever the sun sets, the sky often turns orange before getting dark.
- Gray: During cloudy and rainy days, the sky is often gray.
The sky can be blue. When we say “the sky is blue,” we accept a common characteristic of the sky as a genuine fact. If we want a concise output, “the sky is blue” is an acceptable model output. If we want to be highly detailed, we’d want an output that outlines other colors like the ones we listed above.
Data quality and contamination
Contamination can come from various points within your data pipeline. The most common points of contamination are sourcing, preprocessing and annotation. Sources contain raw data and annotation provides datasets with context and metadata.
General chatbots are often trained on massive foundations including traditional sources of text like books and articles as well as newer digital-native sources like social media. The tone and quality of web text can vary drastically. Some sources contain acceptable bias like “The sky is blue.” Other sources can contain much more harmful forms such as gender and racial bias. Teams need to carefully review data sources before adding them to the data pipeline.
When we preprocess our data, we remove duplicates and often filter things out based on certain columns. When we remove too many data points, biases can actually overcorrect in the other direction. Take archive web data for example. If we aggressively filter older pages, modern viewpoints get overrepresented in the patterns.
During annotation, we use added context to enrich datasets. When done properly, this can improve a dataset. When this is done carelessly, we can introduce new pieces of information that actually contaminate the data. When annotation doesn’t highlight the proper patterns, AI models often infer the wrong ones. A model trained on recent US presidency data might see that presidents are getting older. It then might infer that older presidents are better presidents. Proper annotation practices could provide additional context — so the model learns that age is not necessarily a good predictor of performance.
The things above don’t seem like genuine ethical concerns. However, we need to remember that people are beginning to use AI analysis in their decision making processes now. A biased model creates biased outputs and biased outputs influence real world human decisions.
Public vs. private web data
Not all web data is equal. Public web data is similar to the town square or the street outside your house. Private web data is accessed via permission and those permissions are given under certain conditions we need to take into account. Regardless of sourcing, we need to look at these things from an ethical standpoint.
Public web data
When we think of public web data, we often think of landing pages, sales pitches and product pages. However, public data is not limited to this scope. If you’re old enough to remember phone books, you likely remember a time when address and contact information was considered public.
Public availability is not a guaranteed green light. Public web data is often still subject to copyright laws and basic human morality.
- IP concerns: You can scrape web pages and use them for research. You cannot republish them under your own name — that would be plagiarism. You cannot duplicate an open source software project and take credit for it either. Usage is subject to the license it was published under by the developer.
- Image and likeness considerations: Images of famous people — celebrities and politicians — are all over the internet. When a data pipeline uses someone’s image or likeness, this can lead to deepfakes and even unintentional reproduction. There have been cases where AI models produce images, audio or video outputs eerily resembling those of real people — a sort of unintentional deepfake.
- Privacy concerns: Today, we see contact information like email, phone numbers and mailing addresses that are still often publicly accessible. That said, it would be highly irresponsible to train AI models on personal information. In fact, modern social media tends to yield a much more detailed profile of people including things such as political alignment and religious views. Training models directly on this information can yield incredibly detailed profiles used for ad targeting or even subtle manipulation.
- Open source software: Open source software (OSS) has been a blessing to developers around the world. However, we need more scrutiny when training models on this code. Models tend to overgenerate. A large code block is not necessarily a good code block. If you ask a model to generate a “secure hello world program,” you can literally watch as our one line of code balloons into ten. Today, programs that were traditionally 100 lines now explode to around 1,000. Without human scrutiny, we can wind up with well written codebases that humans can no longer understand, let alone maintain. When our software powers critical defense and healthcare systems, this is no longer a minor annoyance. This is a real ethical concern.
All public web data needs to be scrutinized before being added to a training pipeline. Some public sources include harmful biases or even outright falsehoods. When training on public images, we can run into genuine likeness concerns. In this day and age, personal information is widely available even in public. However, it should not be used to train AI models. Finally, public code repositories need to be reviewed before they make it into the pipeline.
Public data sources can be found all over the web. Below are some free data sources. Quality can vary based on the provider.
- Hugging Face: An open repository, similar to GitHub, where people publish models and datasets.
- Kaggle: Similar to Hugging Face, teams publish models and datasets for other teams to use.
- LAION: LAION is a small, curated repository. They provide high quality datasets for public education and research purposes.
Some reputable paid providers are listed below. These services often provide datasets as well as on-demand data APIs.
- Bright Data: Enterprise grade web data infrastructure. Teams can choose from on-demand data APIs as well as ready-to-use datasets.
- Appen: Appen is one of the oldest AI data providers in the industry. They utilize a global workforce rather than automated systems.
- Decodo: Extract raw web data for further processing and usage in your AI pipeline.
- ZenRows: The Universal Scraper API gives teams a simple way to convert raw data into a structured API.
Private web data
Private web data comes with a completely different set of challenges. For the most part, private data should not be used in training at all. When data is held behind logins or auth tokens, you are subject to a license agreement. Some providers allow usage for AI purposes and some providers don’t.
- Privacy concerns: Private systems can contain personal information such as communications, medical information, financial history as well as other sensitive information. This can lead to data leakage.
- Conditional access: Licenses are usually provided for specific purpose. You might use medical data to speed up diagnosis. Publishing records from a medical data stream creates serious risks to everyone involved.
- Stewardship: When an organization deals in private data, they are usually under moral and legal obligation to protect it. When they give your company permission to access this data, it becomes your responsibility as well.
When dealing with sensitive information, synthetic data is often the best choice. Using synthetic data, models are often trained using real world or highly realistic datasets. Once the models have learned the patterns in the data, they generate new datasets reflecting those same patterns. This process is highly effective for teams dealing with financial, medical and other types of mission critical data.
Some notable synthetic data providers are listed below.
- Anyverse: Generate synthetic data for self driving cars and defense systems.
- Mostly AI: A privacy focused provider of synthetic data. Teams use Mostly AI create anonymized data for use in heavily regulated industries.
- Scale AI: Often used for annotation and model evaluation, Scale AI also provides synthetic data generation.
Security certifications and compliance frameworks
Ethical AI data practices are tied to security and regulatory compliance. While specific frameworks can vary based on jurisdiction, there are a variety of standards and certifications that pertain specifically to of web data pipelines and AI.
Some of the most common compliance frameworks are listed below.
- ISO/IEC 27001: The international standard for information security management systems (ISMS) and gives teams a structured plan for managing storage systems and access controls.
- ISO/IEC 27017: A framework extending ISO 27001 with security controls designed specifically for cloud computing environments.
- ISO/IEC 27018: A standard focused on protecting personally identifiable information (PII) stored in public cloud services. This ensures that personal data storage practices follow strict privacy safeguards.
- SOC 2: An independent audit framework evaluating controls around security, availability, processing integrity, confidentiality and privacy.
- SOC 3: A public facing version of SOC 2 designed to provide high-level assurance that systems meet the Trust Services Criteria. SOC 3 reports allow people outside the organization to review public audit summaries.
- GDPR: The European Union’s General Data Protection Regulation governing how organizations handle personal data.
- CCPA: California’s consumer privacy law establishing transparency and control over personal data. The CCPA is a western cousin of GDPR.
These certificates and frameworks provide organizations with standardized principles to follow when building secure and compliant data pipelines.
Conclusion
Ethical web data is not a one and done solution. This industry is constantly changing. New development and ethical questions arise every day. As time continues, questions about privacy, IP and responsible sourcing will continue to evolve and likely get deeper. The frameworks we develop to address these questions will decide the datasets and outputs of tomorrow.
Public web data can provide an excellent foundation for model training and retrieval-augmented generation. However, this does not eliminate the need for scrutiny when selecting sources or tedious care when preprocessing or annotating your datasets. Private data raises an even higher level of concern and teams are often much safer using synthetic data when dealing with potentially sensitive information.
Responsible data practices are just as important as model architecture.
Garbage in = garbage out.