4 Tips for Recognizing and Avoiding Analytics Bias

One of the key cornerstones of the emerging field of ethical, explainable AI is recognizing and avoiding bias. As AI takes on a greater role in organizations with sometimes opaque calculations, there is an increased urgency in many businesses to get ahead of these challenges, and companies, such as IBM, Salesforce and Microsoft, have already added roles specifically with the aim of ensuring that ethics are a key consideration of AI.

And, while many things about the ethics of AI are subtle and require sophisticated thought and debate, such as the gray areas of data privacy and the fuzzy line between useful and invasive information harvesting/monitoring, most considerations of bias have not fundamentally changed. Those considerations can still be – and are – informed by statistical models that were built by people with pocket protectors, pencils and pads of paper, models which often discriminated against people by age, gender, race or other classes. The discrimination that arose from those models inevitably led, particularly in the Financial Services arena, to regulations and legislation borne of a need for vigilance, of a recognition to ward against adverse lending practices, improper interest rate calculations, unfair housing practices and other forms of inequity.

So, why is this still an issue? In short, Auto ML and other AI techniques often employ black box approaches that make it difficult to understand if these kinds of data are used in an evaluation. The good news is that there are some simple techniques that can help to avoid AI-bias traps.

Choose Your Data Thoughtfully

A few months ago, I was having a discussion with geneticist and popular BBC personality Adam Rutherford about bias in the context of AI. Adam knows a thing or two about this, as he has written a book called "How to Argue with a Racist," in which he talks about his time spent in online chat rooms full of hate speech. He told me that most people misunderstand the bias in data. It’s not so much that the data out there is biased, but more about how we select the data to perform our analytics. In some cases, this can be an act of omission, such as the now notorious example of how some soap dispensers do not recognize dark skin tones, because they were not trained to see it. However, there are many are acts of commission, by cherry picking specific data that is meant to reinforce one’s argument rather than seeing the entire picture. The 1994 book "The Bell Curve" is often thought of in this light, because of its effort to link race to intelligence. Thus, as you endeavor on a problem or opportunity, think carefully and broadly about the data available to you. Cast a wide net with some potentially unconventional sources. And, understand the data. If you choose the U.S. voting numbers of 1900, there would be no female votes, as the right to the female vote was granted in 1920. And, while people of color were constitutionally eligible to vote, local restrictions blocked many black votes from being cast. The data itself is not biased. It simply reflects a time and place. Failure to understand this context will invariably lead to a misunderstanding…and generally bias.

And, one more point on this. Philip Tetlock popularized the concept of the fox and the hedgehog in his famous analysis of effective forecasting. Quoting Isaiah Berlin’s essay (which, in turn, borrows from a famous poem by the Greek poet Archilochus), he outlines how the fox knows many things, but the hedgehog just one. Vikram Mansharimani explains how the hedgehog way of thinking, which involves entering into an analytical situation with a point of view, generally blinds the person to what the data is suggesting. This is a warning to political, economic and practical everyday decisions.

Data Obfuscation for Sensitive Data

The most sure-fire way to ensure that people do not create algorithms based on elements, such as age, gender and race, is to remove them from the data prior to use. This is more easily said than done. The first step is the careful tagging of discrete data elements with obfuscation policies in place at time of extract for model build. This in itself is a difficult process that requires a very mature data governance structure. Data scientists often complain that they could not possibly build models based on anything but clear data, but this is not the case. Data scientists often use sensitive data to match across sources, but this can be hashed and still match. Data scientists often will say that they need things like name and address for geo analysis, when in fact postal codes will do. That said, in many models, such as retail next-best action, it may be perfectly acceptable to use protected data. Perhaps men and women have different propensities for the movies that they watch or the things that they buy and most would agree that the use of sensitive data is not harmful. However, when this starts to leak into the realms of discount offers or mortgage decisions, this can lead to big problems.

Review Your Features

Modern Machine Learning Techniques will generally create algorithms that leave a breadcrumb trail of features. Make sure that you review the features that are most important to the model. Even if your math and technology skills are rusty, you should be able to understand the key data that is contributing to an outcome. And, once these features become evident, it is important to review them to see if it would be within bounds ethically to use them for the particular purpose that you are using them for. Consider an independent person from the analytic process to help in this respect.

The newest fad in Artificial Intelligence is Explainable AI, which builds the same kind of feature insight as more conventional Machine Learning. Without this explainability, use AI for sensitive decisions with extreme caution.

Demand Story Telling

One of the tragedies of our modern life is that we want and accept things in sound bites. Chocolate is good for your heart. Women shop but men buy. Frankly, the simpler the analytic description, the thinner the support and the less useful the analytics will inevitably be.

Rather, when listening to an analysis, ask the 5 whys. Get to the root cause. Make sure that the model has multiple features. Ask about exceptions where the analytics don’t hold. While you may simply say that this is just common sense, it is also a way to listen closely for tropes and biases. If you hear a compelling, well-argued analysis without these, then the analytic is likely more thoughtfully developed and devoid of bias.


We’re human. And, as such, we bring our history, personality and biases to everything that we do. We just need to be careful that, as we build models, that these subtle biases don’t disproportionately affect the people that least deserve it. Luckily, as AI transparency increases and social consciousness rises for ethical AI, there is hope to believe that we can and will do better.

One of the key cornerstones of the emerging field of ethical, explainable #AI is recognizing & avoiding #bias. How can your organization avoid it? @Qlik's @JoeDosSantos offers up some tips.


In this article:

Keep up with the latest insights to drive the most value from your data.

Get ready to transform your entire business with data.

Follow Qlik