** UPDATED MARCH 2024 **
Alrighty then! Now that we’ve got a better understanding of, what bias is, and its roles and impacts within the realm of Human Thinking, Machine Learning, and Artificial Intelligence, lets focus in on how those biases make their way into these systems. Later, we will discuss what we can do about it. For now, though, let us just focus on building our understanding of how it happens before we try and move onto fixing it.
You now have the ability to explore more deeply, across a wider swath of solution-sets than you’ve EVER had previously. However, as every great explorer knows, the further you go, the more unknown the risks that may lay ahead. But when has that ever been enough to stop the curious!? So onward we must go, into the vast unknown. Thankfully, we don’t have to go completely blindly. By looking more closely at how various types of bias make their way into these systems, we can better control and harness their power for good. Or… at least to do no harm.
One of the first, and often most problematic, ways that an undesirable bias can enter the system is related directly to an issue we face in our own day-to-day thinking (surprise, surprise, the student suffers the same limits as the teacher). Most people have likely heard the old saying, “Correlation does not equal Causation.” And we all know how true it is, and how easy it is to fall prey to this error in our own thinking. Though once we enter the world of ML/AI, that old saying needs a small update … “Correlation is not Causal; BUT, it is Calculable". There’s really nothing more problematic for a computer than something that “adds up”, while simultaneously being “incorrect”. In the world of binary there is no “almost” or “kind of”, something either is or is not. If it adds up, how can it be incorrect? “DOES NOT COMPUTE!!!”, panics the AI.
One primary way bias seeps into ML/AI is through the training data utilized to inform these systems. If the data set is skewed or lacks representation, the AI is likely to inherit these biases, perpetuating them in its outputs. This issue is compounded when considering that data not only reflects historical prejudices but also might be influenced by the lack of diversity among the developers themselves, who may unknowingly encode their biases into the system. -OC Ferrell
Next up on the list of ways bias creeps into ML/AI, we have every Data Scientists' “frenemy” (friend-and-enemy) … Training Data. Now there are a few ways in which the Training Data used can have an impact on the resulting bias developed within the system.
The struggle against automated discrimination in Europe has seen civil society actors actively questioning the use of AI systems, particularly focusing on the challenges posed by "black box" algorithms in public administrations. These concerns revolve around AI's potential to perpetuate racialized discrimination and highlight the need for a broader understanding of AI errors as symptoms of systemic social issue.
Some are obvious to us, such as if the Training Data only represents a small subset of a population, it is not likely to be representative of the full complexity of that population. Or, that if the Training Data contains only a few details about each data element, it won’t have enough information to make accurate evaluations.
Some are much less obvious at first glance though! One such example, particularly prevalent in the modern Big Data paradigm, would be that if the Training Data used contains too many details about each data element, we can overwhelm the algorithms with “useless” extraneous information that does not assist it in reaching the desired solution. Worse yet, if that needless extra data correlates and calculates!?
Well, now we are REALLY losing our way. Aside from the quantity of details in the Training Data, either too many or too few, we can also fall prey to any number of quality-related issues. We’re not talking about the usual “these data types don’t line-up” kind of data quality issue that we have been solving with Data Normalization practices for decades now.
No. We’re talking about the much more vaguely defined “Learning-Data Quality”.
What we mean is, “How WELL does the data represent the full picture of the reality we are modeling and evaluating?” Is the Data representative and inclusive, in all the ways we need it to be, without being too much and risking the data-overload we just identified as also being problematic?
We won’t be getting to the discussion around what we can do about issues in Training Data until later, though in this case many of you likely already see where this one is going. The delicate dance of finding the right BALANCE in the training data we use. More on that in a later piece…
We also know that ML/AI has also been built along the same lines as to how we conceive our own thinking to function. So it comes as no surprise that ML/AI algorithms are HIGHLY susceptible to very human bias problems. Only they can do it at scales of billions of times faster than we can! “Faster is always better, right!?”
Well, perhaps not if we’re talking about a car crash. It’s always contextual. So it is going to be up to us to keep context in mind when we design these systems. The second way that biases can arise in the algorithms we can thankfully take much less credit for, though we’re still just as accountable to. That is, because of the way that many of these ML/AI systems function, under the hood, we often don’t have very strict bounds around what the ML/AI should NOT look at or consider. And understandably so.
Given that we are trying to use these systems to solve previously unsolvable problems, it can be tempting to put as few restrictions around exploratory paths as possible; Because, frankly, we just DO NOT KNOW what exactly we can safely exclude without limiting the likelihood of succeeding. Sometimes because we haven’t properly defined the problem yet. Sometimes because we just don’t have a hot-clue where to start.
In a nod to any FPS Gamers reading this, ML/AI is the Data Science equivalent to a “Spray-and-Pray” strategy for finding solutions, or possible interesting paths worth exploring further. It’s about as much luck as it is skill. Honestly, probably more luck…
Now, stepping away from the technical issues that allow biases to arise in ML/AI Algorithms, let us turn the focus back on us, the humans designing, building, testing and using these Algorithms. As is often the case, we are the masters of our own demise when it comes to bias in ML/AI. Due to the nature of how we have designed so many of these ML/AI systems to work, we humans are, more often than not, completely blind to HOW the ML/AI is making the decisions that it is.
It falls on the spectrum between unlikely and impossible that we could look at the input data, and the resulting output, and reverse-engineer exactly how it is that the system made the determinations that it has. This is a massive blind spot within ML/AI algorithms that both limits our ability to evaluate for bias mid-stream, as well as our capacity to intervene to prevent undesirable outcomes from reoccurring.
If history has taught us one thing, over and over again, it is that you cannot begin to steer away from disaster until you recognize it approaching.
ML/AI is essentially attempting to action the present/future, based on the past. And given that nearly ALL of that data has been collected about … well … us, in one context or another, all of these systems are heavily dependent on who we have been in our own pasts. Let’s be honest, we all know that historically speaking, we haven’t always been who/how we would LIKE to be. It’s no secret.
But for ML/AI, this poses a significant risk. Essentially, ML/AI is learning from who/how we HAVE been, and then attempts to emulate and improve incrementally. “I LEARNED IT WATCHING YOU!!!!!”, barks the AI.
It may seem like there are too many problems, too deeply engrained in our own patterns of thinking and past action to try and tackle them, but all hope is not lost!! There is much that we CAN do to improve ML/AI Systems, as well as to limit the potential for, or at least the impact from, the most undesirable outcomes. We will explore some of these possible solutions and controls in our next post.
Before we get there though, we want you to consider everything you now understand about Bias in General and how it makes its way into our ML/AI Systems, and ask yourself this question, “How would I go about approaching solving for one or more of these possible paths to bias?”
Then consider how doing so might impact the functionality/reliability of a ML/AL System. Much like the problems around Training Data, finding the correct balance is going to be key to our future success in solutions for our concerns.
This blog's author, Steven Holt, is a Senior ETL Developer with our Digital Transformation Practice.
Please add your inquiry or comments for Steven in the form below and he'll be sure to get back to you!