“When Deep Blue, IBM’s chess-playing supercomputer, beat Garry Kasparov in 1997, computers were still just computers,” noted a recent NY Times Magazine article, “We Don’t Really Know How A.I. Works. That’s a Problem,” by freelance writer Oliver Whang. Deep Blue determined the best next move by simulating and assigning values to board positions up to 12 moves ahead — amounting to billions of positions — using algorithms explicitly programmed by its designers. “There was no mystery around what was going on inside them,” wrote Whang, “even though they were, in a way, intelligent.”
Fifteen years later, everything changed. In 2012, researchers at the University of Toronto developed AlexNet, a neural-network system that identified objects in images far more accurately than previous approaches. AlexNet’s success transformed AI research and accelerated the adoption of deep neural networks across a wide range of applications.
But there was a catch. Unlike Deep Blue, neural networks operate largely as black boxes. As these models become larger and more capable, they also become increasingly difficult to understand — even for the researchers building them. These systems represent a new kind of machine intelligence whose internal reasoning processes remain poorly understood, even by the researchers building them.
This has led to the growing field of AI interpretability, whose goal is to better understand how modern AI systems actually work internally, especially as they are increasingly deployed in high-stakes applications ranging from healthcare and finance to law enforcement and military systems.
What Is Interpretability?
Interpretability is based on the concept that “in order to narrow or even bridge the expanding knowledge gap between A.I. models and humans, we need to treat A.I. more like a natural phenomenon than a human invention,” noted the NYT article. “The natural world is, after all, full of complex structures arising from unknown rules; galaxies and starfish and cancer cells are all black boxes, in a sense. … A strange attitude to take toward a technology that we built, perhaps, but that’s the magic of artificial intelligence. It can baffle its own creators.”
“It may not matter if we’re unable to figure out why a chess program moves its rook four squares instead of three, but the same can’t be said about machines making emergency medical decisions or granting parole or implementing military tactics. … Imagine being told you need surgery, asking why, and all the doctor can say is, ‘Because a computer said so.’ What if the computer is wrong? We could tolerate such deference only if we trusted the A.I. more than the people who would otherwise make such decisions. And how could we do that if we didn’t even know how the system worked?”
A recent IBM article, “What is AI interpretability,” explained that “As highly complex models (including deep-learning algorithms and neural networks) become more common, AI interpretability becomes more important. Additionally, AI systems and machine-learning algorithms are increasingly prevalent in healthcare, finance and other industries that involve critical or life-altering decisions. With such high stakes, the public needs to be able to trust the outcomes are fair and reliable. That trust depends on understanding how AI systems arrive at their predictions and make their decisions.”
“Interpretability is about transparency, allowing users to comprehend the model’s architecture, the features it uses and how it combines them to deliver predictions.” An interpretable model is one whose decision-making processes are easily understood by humans. “Greater interpretability requires greater disclosure of its internal operations.”
Interpretability is important for several key reasons:
- Trust: Users are more likely to rely on AI systems when they understand how decisions are made, especially in applications such as medicine or finance.
- Bias and fairness: Interpretability helps identify discriminatory patterns based on characteristics such as race, age, or gender, enabling developers to mitigate bias.
- Debugging: Interpretability helps developers pinpoint the sources of incorrect predictions and optimize the model’s performance. This process, in turn, increases its overall reliability and aids optimization.
- Regulatory compliance: Interpretable AI models provide clear explanations for their decisions, helping to meet regulatory requirements as well as auditing issues, liability and data privacy protections.
- Knowledge transfer: Interpretability makes it easier to share knowledge about a model’s underpinnings and decisions among all the various stakeholders.
Nigam Shah, — Stanford professor of medicine and a member of the Stanford Institute for Human-Centered Artificial Intelligence, — has identified three main types of interpretability:
- Engineers’ interpretability aims to explain how the AI model works and how it reached its output. It involves understanding the model’s internal workings and is relevant to developers and researchers who need to debug or improve the model.
- Causal interpretability is focused on why the model produced its output. It involves identifying the factors that have the greatest influence on the model’s predictions and how changes in these factors affect the outcomes. And,
- Trust-inducing interpretability which provides the necessary explanations that build trust in the model’s outputs by presenting the model’s decision-making process in a way that is understandable and relatable to users, even if they do not have technical expertise.
Researchers have also developed multiple dimension for classifying interpretability techniques:
- Intrinsic vs. post-hoc: Intrinsic interpretability applies to simpler models that are easier to understand such as decisions trees; post-hoc interpretability applies to pre-trained models and are best for more complex models.
- Local vs Global: Local interpretability explains individual model predictions, while global interpretability aims to understand the model’s behavior across the entire dataset.
- Model-specific vs. model-agnostic: Model-specific interpretability methods use an individual model’s internal structure to provide explanations while model-agnostic methods can work with any type of model.
Interpretability often involves trade-offs. Simpler ‘white-box’ models, — that is, models with relatively straightforward internal logic, variables, and decision-making steps, — are generally easier to understand but may deliver lower accuracy than more complex ‘black-box’ systems whose internal workings are a mystery to its users, such as deep neural networks. In addition, interpretability methods themselves are not yet standardized. Different techniques can produce different explanations for the same model, and what counts as a satisfactory explanation may vary depending on the audience.
Conclusions
“Interpretability research is especially difficult because it takes place within the rush of A.I. development,” said the NYT article in conclusion. “Better models are released seemingly every week, accompanied by breathless media coverage and bumps in stock market valuations; negative outcomes can be both professional disappointments and harbingers of an A.I. bubble popping.”
“Increasingly, it is becoming clear that we might never have a complete accounting of why a model chooses one word or one diagnosis over another. Wars could soon be fought by A.I. agents with intractable, alien minds and opaque motives. A scientific discovery might be locked inside the neural net of an A.I. system, never to be extracted. Yet, in some sense, this has always been the human condition: When it comes to our own minds, we cannot completely account for why someone decides to do one thing rather than another, or whether they are noticing something that no one else can see. Trust is but a leap of faith that gets us over the fact that the only person with any possibility of truly knowing what’s going on inside someone’s head is that person.”
The hope is that, over time, AI development will become less frenetic and interpretability research will mature into a more systematic scientific discipline — more like biology or psychology and less like crisis management. Science is often slow and iterative, but its methods have repeatedly proven reliable. New ideas are tested, refined, rejected, and improved over time. As with earlier scientific revolutions, understanding advanced AI systems may ultimately take decades rather than years.
