If You’re De-Biasing The Model, It’s Too Late

by James Lennon on August 18th, 2020

If You’re De-Biasing The Model, It’s Too Late


Machine learning is a once-in-a-generation technological shift that is creating huge value by unlocking insights from data. But algorithmic bias remains a major concern as ML becomes more widespread. Unless ML models are trained on representative data, they can develop serious biases, significantly harming underrepresented groups and leading to ineffective products. We investigated the CoNLL-2003 dataset—a standard for building algorithms that recognize named entities in text—and found that the data is highly skewed toward male names. Using Scale’s technology we were able to systematically mitigate this bias by:

  1. Enriching the data to illuminate the hidden biases
  2. Enhancing the dataset with underrepresented examples to mitigate gender bias

A model trained on our enhanced CoNLL-2003 dataset has both less bias and better performance—demonstrating how bias can be removed without any changes to the model. We’ve open-sourced the Named Entity Recognition annotations for the original CoNLL-2003 dataset and our augmentation of it here.

Algorithmic bias: AI’s weakness

Today, thousands of engineers and researchers are building systems that teach themselves how to achieve significant breakthroughs—improving road safety with self-driving cars, curing diseases with AI-optimized treatments, and combating climate change by managing energy usage.

But the power of self-learning systems is also their weakness. Because data is the foundation of all machine learning applications, learning from imperfect data can lead to biased outputs.

The power of AI systems means their potential to cause harm if they are biased is significant. Recent protests against the police brutality that has led to the tragic deaths of George Floyd, Breonna Taylor, Philando Castile, Sandra Bland and many others are an essential reminder of the systemic inequalities in our society that AI systems must not perpetuate. But, as we know from countless examples—whether it’s image search results perpetuating gender stereotypes, offender management systems discriminating against black defendants, or facial recognition systems misidentifying people of color – we have a long way to go before we can say the problem of AI bias has been solved.

Bias is prevalent because it is so easily introduced. It creeps up, for example, in gold-standard open-sourced models and datasets that are the foundation of huge volumes of work in ML. The word2vec sentiment dataset for building other language models is skewed by ethnicity, and word embeddings—the way an ML algorithm represents words and their meanings—have been found to contain strongly biased assumptions about what occupations women are associated with.

The problem—and at least part of the solution—lie in data. To illustrate this, we’ve conducted an experiment into one of the most popular datasets for building systems that can recognize named entities in text: CoNLL-2003.

What is Named Entity Recognition?

Named-Entity Recognition (NER) is one of the fundamental building blocks of natural language models—without it, online searches, information extraction, and sentiment analysis would be impossible.

At Scale AI, our mission is to accelerate the development of AI. Natural language is one of our main areas of focus. Our Scale Text offering includes NER—which involves annotating text according to a predefined list of labels. In practice, this might help major retailers analyze how their products are being discussed online, among other applications.

Many NER systems are trained and benchmarked on CoNLL-2003—a dataset of roughly 20,000 sentences from Reuters news articles that are annotated with attributes such as ‘PERSON’, ‘LOCATION’, and ‘ORGANIZATION’.

We wanted to explore if the data was biased. To do this, we used our Scale AI labeling pipeline to categorize all the names in the dataset, asking whether they could be male, female, or either, inferring gender based on the traditional use of the name.

We found a dramatic difference. Male names are mentioned almost five times more than female names, and less than 2% of names were gender-neutral:

This is because news articles, for societal reasons, tend to contain mostly male names. But an NER model trained on this data will be better at picking out male names than female. For example, search engines use NER models to help classify names in search queries to give more accurate results. But deploy a biased NER model, and the search engine will be worse at identifying names of women than of men—exactly the kind of subtle, pervasive bias that can creep into many real-world systems.

A novel experiment to reduce bias

To illustrate this, we trained an NER model to investigate how this gender bias would affect its performance. We built a name extraction algorithm to pick out PERSON labels using spaCy, a popular NLP library, and trained the model on a subset of the CoNLL data. When we tested the model on novel names in the test data that weren’t present in the training data, we found that the model was 5% more likely to miss a new female name than a new male name—a significant discrepancy in performance:

We saw similar results when we used our model on the template “NAME is a person”, substituting the 100 most popular male and female names for each year of the US census—the model performs significantly worse on female names for all years of the census:

Crucially, biased training data means errors are skewed toward underrepresented categories. This census experiment illustrates this a second way, too: the model’s performance degrades significantly after 1997—the cut-off point of the Reuters articles in the CoNLL dataset—because the dataset is not representative of the popularity of names in the years since.

Models learn to fit the trends of the data they are trained on. They can’t be expected to perform well on cases for which they have seen few examples.

If you’re de-biasing the model, it’s too late

So how do we fix this?

One way is to try and de-bias the model—perhaps by post-processing the model, or adding an objective function to mitigate bias, leaving it to our model to figure out the details.

But there are several reasons why this is not the best approach:

  1. Fairness is a highly complex issue that we can’t expect an algorithm to define on its own. Research has shown that training an algorithm to perform equally on all subsets of the population will not ensure fairness and will cripple the model’s learning.
  2. Adding extra objective functions can hurt model accuracy, causing a tradeoff. Better instead to keep the algorithm simpler and ensure the data is balanced—improving model performance and avoiding the tradeoff.
  3. It’s unreasonable to expect a model to perform well on cases for which it’s seen few examples. The best way to ensure good results is to improve the diversity of the data.
  4. Attempting to de-bias a model with engineering techniques is expensive and time-consuming. It’s much cheaper and easier to train your models on unbiased data in the first place, freeing up your engineers to focus on applications.

Data is only one part of the bias problem. But it is foundational, affecting everything that comes after it. That’s why we think it holds the key to some of the solution, providing potential systematic fixes at the source. Unless you explicitly label for protected classes, like gender or ethnicity, it is impossible to properly mitigate these classes as a source of bias.

This is counterintuitive. Surely, if you want to build a model that doesn’t depend on sensitive characteristics, like gender, age or ethnicity, it’s best to omit those properties from the training data so the model can’t take them into account?

“Fairness by ignorance” actually makes the problem worse. ML models excel at drawing inferences across features—they don’t stop doing this just because we haven’t explicitly labeled those features. The biases simply remain undetected, making them harder to remove.

The only robust way to deal with the problem is to label more data to balance out the distribution of names. We used a separate ML model to identify sentences in the Reuters and Brown corpora likely to contain female names, and then labeled those sentences via our NER pipeline to augment CoNLL.

The resulting dataset, which we’ve called CoNLL-Balanced, has over 400 more female names. When we retrained our NER model on it, we found that our algorithm is no longer biased to perform worse on female names:

Not only that, but the model also performed better at identifying male names, as well.

This is an impressive demonstration of the power of data. Mitigating bias at the source meant we did not have to make any adjustments to our ML model—saving engineering time. And we achieved this without any tradeoff in our model’s performance—in fact, its performance slightly improved.

To allow the community to build on our work and mitigate gender bias in models built on CoNLL-2003, we’ve open-sourced Scale AI’s augmented dataset, which also includes gender information on our website.

The AI/ML community has its own diversity problems, but we are cautiously excited by these results. It suggests how we might be able to offer a technical solution to a pressing social problem—provided we tackle the issue head on, illuminating the hidden biases and improving the model’s performance for everyone.

We’re now exploring how we might apply this approach to another highly sensitive attribute—ethnicity—to explore whether we can create a robust framework for eliminating dataset bias that scales to other protected classes.

It also illustrates why we at Scale AI pay so much attention to data quality. Unless data is provably accurate, balanced, and unbiased, there’s no guarantee that models built on it will be safe and accurate. And without that, we won’t be able to build the transformational AI technologies that benefit everyone. If you are developing AI technologies and would like help in balancing your own datasets to ensure your models work for everyone, reach out to us.


The CoNLL 2003 Dataset referenced in this blog post is the Reuters-21578, Distribution 1.0 test collection, available on the project page for the original 2003 experiment:

Get Labeled Data Today