Deep Learning Doesn’t Need To Be a Black Box
The cultural perception of AI is often suspect because of the challenges in knowing why a deep neural network makes its predictions. So researchers try to crack open this black box after a network is trained to correlate results with inputs. But what if the goal of explainability could be designed into the network's architecture, before the model is trained and without reducing its predictive power? Maybe the box could stay open from the beginning.
Deep neural networks can perform wonderful feats thanks to their extremely large and complicated web of parameters. But, their complexity is also their curse: The inner workings of neural networks are often a mystery, even to their creators. This is a challenge that has been troubling the artificial intelligence community since deep learning started to become popular in the early 2010s.
In tandem with the expansion of deep learning in various domains and applications, there has been a growing interest in developing techniques that try to explain neural networks by examining their results and learned parameters. But these explanations are often erroneous and misleading, and they provide little guidance in fixing possible misconceptions embedded in deep learning models during training.
In a paper published in the peer-reviewed journal Nature Machine Intelligence, scientists at Duke University propose concept whitening, a technique that can help steer neural networks toward learning specific concepts without sacrificing performance. Concept whitening bakes interpretability into deep learning models instead of searching for answers in millions of trained parameters. The technique, which can be applied to convolutional neural networks, shows promising results and can have great implications for how we perceive future research in artificial intelligence.
Features and Latency Space in Deep Learning Models
Given enough quality training examples, a deep learning model with the right architecture should be able to discriminate between different types of input. For instance, in the case of computer vision tasks, a trained neural network will be able to transform the pixel values of an image into its corresponding class. (Because concept whitening is meant for image recognition, we’ll stick to this subset of machine learning tasks. But many of the topics discussed here apply to deep learning in general.)
During training, each layer of a deep learning model encodes the features of the training images into a set of numerical values and stores them in its parameters. This is called the latent space of the AI model. In general, the lower layers of a multilayered convolutional neural network will learn basic features such as corners and edges. The higher layers of the neural network will learn to detect more complex features such as faces, objects, or full scenes.
Each layer of the neural network encodes specific features from the input image.
Ideally, a neural network’s latent space would represent concepts that are relevant to the classes of images it is meant to detect. But, we don’t know that for sure, and deep learning models are prone to learning the most discriminative features, even if they’re the wrong ones.
For instance, a data set contains images of cats that happen to have a logo in the lower right corner. A human would easily dismiss the logo as irrelevant to the task. But a deep learning model might find it to be the easiest and most efficient way to tell the difference between cats and other animals. Likewise, if all the images of sheep in your training set contain large swaths of green pastures, your neural network might learn to detect green farmlands instead of sheep.
During training, machine learning algorithms search for the most accessible pattern that correlates pixels to labels.
So, aside from how well a deep learning model performs on training and test data sets, it is important to know which concepts and features it has learned to detect. This is where classic explanation techniques come into play.
Post Hoc Explanations of Neural Networks
Many deep learning explanation techniques are post hoc, which means they try to make sense of a trained neural network by examining its output and parameter values. For instance, one popular technique to determine what a neural network sees in an image is to mask different parts of an input image and observe how these changes affect the output of the deep learning model. This technique helps create heatmaps that highlight the features of the image that are more relevant to the neural network.
Other post hoc techniques involve turning different artificial neurons on and off and examining how these changes affect the output of the AI model. These methods can help find hints about relations between features and the latent space.
While these methods are helpful, they still treat deep learning models like black boxes and don’t paint a definite picture of the workings of neural networks.
“'Explanation' methods are often summary statistics of performance (e.g., local approximations, general trends on node activation) rather than actual explanations of the model’s calculations,” the authors of the concept whitening paper write.
For instance, the problem with saliency maps is that they often miss showing the wrong things that the neural network might have learned. And interpreting the role of single neurons becomes very difficult when the features of a neural network are scattered across the latent space.
Saliency-map explanations do not provide accurate representations of how black-box AI models work.
“Deep neural networks (NNs) are very powerful in image recognition, but what is learned in the hidden layers of NNs is unknown due to its complexity. Lack of interpretability makes NNs untrustworthy and hard to troubleshoot,” Zhi Chen, a PhD student in computer science at Duke University and lead author of the concept whitening paper, told TechTalks. “Many previous works attempt to explain post hoc what has been learned by the models, such as what concept is learned by each neuron. But these methods heavily rely on the assumption that these concepts are actually learned by the network (which they are not) and concentrated on one neuron (again, this is not true in practice).”
Cynthia Rudin, professor of computer science at Duke University and coauthor of the concept whitening paper, had previously warned about the dangers of trusting black-box explanation techniques and had shown how such methods could provide erroneous interpretations of neural networks. In a previous paper, also published in Nature Machine Intelligence, Rudin had encouraged the use and development of AI models that are inherently interpretable. Rudin, who is also Zhi’s PhD advisor, directs Duke University’s Prediction Analysis Lab, which focuses on interpretable machine learning.
The goal of concept whitening is to develop neural networks whose latent space is aligned with the concepts that are relevant to the task it has been trained for. This approach will make the deep learning model interpretable and makes it much easier to figure out the relations between the features of an input image and the output of the neural network.
“Our work directly alters the neural network to disentangle the latent space so that the axes are aligned with known concepts,” Rudin told TechTalks.