How to Remove Bias from Data for Machine Learning

The saga of bias in machine learning is seemingly a never-ending story. A system can become biased in so many ways. Sure, bias doesn't always cause problems and at times it can be easily accounted for, but there are times when a biased model has disastrous consequences—not only for the company but for the reputation of artificial intelligence/machine learning (AI/ML) as a whole.

We highlighted some particularly embarrassing moments along with an introduction to bias in machine learning in this post. Here, we will more briefly describe bias and focus on ways to remove bias. Or, at least, reduce certain kinds of bias from ML data.

What Kinds of Bias Are There in ML?

There is a litany of bias types in the ML sphere. Many of the biases exist owing to the quality of the training data. Other types of bias are inherent, such as temporal bias, because the subject matter changes over time. Bias can be introduced outside of the model, such as in the application of the model, or bias can exist in the model concept itself. For example, an attempt to use ML to predict specific types of intelligence based on the size and shape of someone's head would produce a biased system for predicting intelligence.

ML has so many applications that it can be difficult to pinpoint exactly what "bias" means. In the case of a bathroom scale, for example, the mechanism may be biased toward indicating a heavier weight. In image recognition, the model may have difficulty recognizing nuances. The most famous case, when Google mislabeled a human as a primate, caused a general distrust for AI/ML. Bias in ML has downstream consequences, such as injustice and issues in fairness.

In all cases, the predictions made using ML are driven by data and algorithms that transform the data into predictive models—and because algorithmic bias is less common, we will focus on data bias.

Bias in Data

When it comes to bias in the data, it really boils down to the features of the data. For something as simple as a bathroom scale data biased +3 units on the heavy side, the model itself can be biased -3 units to compensate. Other cases are more complex. Data collected from a self-selecting group of people may be biased toward some facet of the group that caused them to participate. Perhaps swaths of groups are missing entirely or underrepresented in the data. Migratory or seasonal patterns might disrupt data.

For image datasets, the background, environment, quality, or the object itself could contribute to bias. For example, if you have a dataset of satellite images of the Amazon Rainforest, the images might be biased for daytime or clear skies. In that case, the model would not be very good at classifying images taken at night.

Data quality control is one of the first steps to reducing bias in ML. This comes back to the beginning of the process: data collection. Revisiting our bathroom scale example, it may be possible to calibrate the scale prior to taking measurements. Tagged data acquired from uncontrolled sources or tagged by untrained individuals may also bias the model.

Analyze the Data

One of the first steps in model training is data analysis. You can analyze data in two ways: quantitative and qualitative. A quantitative analysis involves use of tools like NumPy and Pandas; qualitative analysis requires knowledge of the subject matter. The combination of both can provide a powerful guard against unwanted bias in the model.

Examples of Quantitative Analysis

You can use Python and the Pandas library to analyze data in some general ways. It's easy to get a good sense of the data using a few of the methods on the DataFrame type.

from pandas import read_csv

# read a csv into a DataFrame referenced by df

df = read_csv('data.csv')

# prints table of count, mean, std, quartiles

df.describe()

# prints table of correlation between columns

df.corr()

# print histograms of each column

df.hist()

Some data lend themselves to various normalization techniques. Even though some types of normalization can introduce bias in a model, they are still useful for analysis. A min-max normalization would help to analyze data with a boxplot where some columns have a much greater range than others. And z-score normalization will compress the scale while retaining valuable information about the range—which is useful when outliers need to be kept in context. The following code creates new DataFrames with both min-max and z-score normalizations: 

min_max = lambda x: (x-x.min())/ (x.max()-x.min())

z_score = lambda x: (x - x.mean())/x.std()

# apply each normalization

# assign the output to new variables

df2 = df.apply(min_max)

df3 = df.apply(z_score)

You can use the new DataFrames to do further analysis of the data. In particular, using "boxplot" will now enable some visualization of the distribution of data across columns—even when the original scale of the values is much different. 

Examples of Qualitative Analysis

Qualitative analysis is just as important as quantitative analysis, especially when used in conjunction with quantitative analysis. It takes a level of understanding the domain and applying thoughtful judgement to ensure the data are of good quality. For example, building a model for scoring fashion (clothing, for example) would be particularly challenging without a great deal of qualitative analysis. It's good to have enough data samples that show excellent quantitative measures. It's even better to have enough samples that are currently relevant, properly labeled, and inclusive.

As Stewart implied in his examples of classifying swans using Convolutional Neural Networks (CNN), you need to also think about samples that are not swans, such as a hunter in a swan-shaped hunting blind, a swan pool toy, or even a swan-shaped cloud someday. Qualitative analysis of the data might be tedious work, but it will lead to better outcomes.

Preparing the Data

The next step is data preparation. This is your chance to do something with the results of the analysis. Missing or sparse feature data? Can someone generate new data? Will you be able to remove records and still have enough data to reliably train the model? Do you need to go back a step and collect more data?

Now, it's already very common to apply transformations to image data, such as zooming, rotating, and the like. You might even get away with altering the color of objects, such as with cars, to produce more variations. But when you have so many samples of swans and few ducks or hunters, the model will not be very useful when it encounters a pelican. No reasonable amount of processing or layering can atone for lack of adequate representation of the context.

If you've concluded that you don't have enough variety in your dataset to properly train the model, you'll have to gather more data. Sometimes it's an option. Other times, the opportunity just isn't there. Gathering more data could require a lengthy process, especially if it's first-party data. Sometimes, third-party data is available to supplement your data. But even then, finding the right data can be quite involved, especially considering the new data would have to be analyzed. If it must be done, so be it!

Conclusion

Removing bias from data boils down to two things: removing or adding data that has been analyzed. The steps to analyze and select data have been largely manual so far, and it can be costly to go back to the source and collect more data (in some cases, it's not even possible). Tools like Pandas can speed up some of the processing. You can even filter out a random subset of swan images from the dataset, for example. But what's even better? Much of the process can now be automated!

This post was written by Phil Vuollet. Phil leads software engineers on the path to high levels of productivity. He writes about topics relevant to technology and business, occasionally gives talks on the same topics, and is a family man who enjoys playing soccer and board games with his children.