Interpretability models​

Why interpretability is so important in machine learning ? Why can't we just trust the prediction of a supervised model ?

Several possible explanations to that: we can think about improving social acceptance for the integration of ML algorithms into our lives ; correcting a model by discovering a bias in the population of the training set; understanding the cases for which the model fails; following the law and regulations.

Nowadays, complex supervised models can be very accurate on specific tasks but remain quite uninterpretable; at the opposite, when using simple models, it is indeed easy to interpret them but are often less accurate. How can we solve such a dilemma ?

This post tends to answer to this question by going through the ML literature in interpretability models and by focusing on a class of additive feature attribution methods [11].

1. The main idea

The problem of giving an interpretation to the model prediction can be recasted as it follows: Which part of the input is particularly important to explain the output of the model ?

In order to illustrate this purpose, let's consider the example given during the ICML conference by Shrikumar. Lets suppose you have already trained a model with DNA mutations causing diseases. Now, let's consider a DNA sequence as input, as for instance:

The model is going to predict if this sequence can be linked to any known diseases the model learnt. If so, what you would like to understand is why your model gives this prediction in particular; ie which part of the input sequence leads your model to predict a specific disease. So, you would like to have higher weights for the parts of the sequence which explain the most the decision of your model and lower ones for those which do not explain the prediction:

To achieve that, most of approaches iterate between 2 steps:

  1. Set a prohibition to some part of the input 
  2. Observe the change in the output (fitted answer) 

Repeat step 1 and step 2 for different prohibitions of the input.

2. Existing approaches

The need of tools for explaining prediction models came with the development of more complex models to deal with more complex data and therefore the recent literature in Computer Vision and Machine Learning has developed a new field linked to interpretability.

2.1. Cooperative game theory based

Back to the beginning of the 21th century, Lipovetsky et al. (2001)[1] highlight the multicollinearity problem in the analysis of regressor importance in the multiple regression context: important variables can have significant coefficient because of their collinearity. To that end, they use a tool from the cooperative game theory to obtain comparative importance of predictors: the Shapley Values imputation [0] derived from an axiomatic approach and produces a unique solution satisfying general requirements of Nash equilibrium.

A decade later, Strumbelj et al. (2011)[2] generalize the use of Shapley values for black box models such as SVM and artificial neural network models in order to make models more informative, easier to understand and to use. They propose an approximation algorithm by assuming mutual independence of individual features in order to encompass the time complexity limitation of the solution.

2.2. Architecture specific: Deep Neural Network

Since then, several specific methods have been proposed in the literature and take advantages of the structure/architecture of the model.

For neural networks, we can think about back-propagation based methods such as Guided Propagation (Springenberg et al. 2014)[4] which use the relationship between the neurons and the output. The idea is to assign a score to neurons according to how much they affect the output. This is done in the single backward pass where you get the scores for all parts of the input. Other approaches propose to build a linear model to locally approximate the more complicated model based on data which affects the output (LIME [6]). Shrikumar et al. 2016 [7] introduces DeepLift which assigns contribution scores to the feature based on the difference between the activation of each neuron to its ‘reference activation’. Other explaining prediction models Deep Neural Network-specific have been proposed in the literature and the reader could read [8] for additional references on this subject.

Below, a list of methods and their available python code which summarizes the most recent approaches for specific-models:

Random Forest:

Deep Neural Network:

2.3. A unified approach

The most recent and general approach for interpretability models is the SHAP model from Lundberg et al. 2018. It proposes a class of methods called Additive Feature attribution methods that contains most of the approaches cited above. These methods use the same explanation model (ie any interpretable approximation of the original model) that we introduce in the next paragraph.

Cooperative Game theory-based:

3. SHAP: Additive feature attribution methods

An explanation model is a simple model which describes the behavior of the complex model. The additive attribution methods introduce a linear function of binary variables to represent such an explanation model.

3.1. The SHAP model

Let f be the original prediction model to be explained and g the explanation model. Additive feature attribution methods have an explanation model that is a linear function of binary variables such that:

where M is the number of features ; the z'i variables represent a feature being observed (zi' = 1) or unknown (zi'= 0), and the Φi ∈ ’s are the feature attribution values. There is only one solution for Φ_i satisfying general requirements of Nash equilibrium and satisfying three natural properties explained in the paragraph that follows.

3.2. The Natural properties

(1) Local Accuracy: the output of the explanation model matches the original model for the prediction being explained:

g(x') = f(x)

(2) Missingness: put the output to 0 corresponds to turning the feature off:

x'i = 0 ⇒ Φi = 0

(3) Accuracy: if turning the feature off in one model which always makes a bigger difference in another model then the importance should be higher in the first model than in the second one.

Lets consider z' \ i meaning z'i = 0, then for any 2 models f 1 and f 2, if:

fx1(z') - fx1(z' \ i) ≥ fx2(z') - fx2(z' \ i)

then for all input z' ∈ {0,1}M :

Φi (f 1, x) ≥ Φi (f 2, x)

3.3 Computing SHAP values

3.3.1. Back to the Shapley values

The computation of features importance -- the SHAP values -- comes from cooperative games theory [0] with the Shapley values.

In our context, a Shapley value can be viewed as a weighted average of all possible differences between predictions of the model without feature i, and the ones with feature i as expressed below:

where |z′| stands for the number of features different from zero, and z′ ⊆ x′ stands for all z′ vectors where the non-zero entries are a subset of entries of x′ except feature i. Since the problem is combinatorial different strategies have been proposed in the literature to approximate the solution ([0,1]).

3.3.2. The SHAP values

In the more general context the SHAP values can be viewed as Shapley values of a conditional expectation function of the original model such that:

where S is the set of non-zero entries of z'. In practice, the computation of SHAP values are challenging that is why Lundberg and al.[11] propose different approximation algorithms according to the specificities of your model or your data (tree ensembles, independent features, deep network,...).

4. Practical example with SHAP library

Lundenberg created a GitHub repository to that end with very nice and quite complete notebooks explaining different use cases for SHAP and its different approximation algorithms (Tree/ Deep / Gradient /Linear or Kernel Explainers).

I do really encourage the reader to visit the page of the author: .

By the way, I am just going to introduce a very simple example in order to give insights of the kind of results we could obtain when looking for interpreting a prediction. Lets consider the heart dataset coming from kaggle competition ( The dataset consists in 13 variables describing 303 patients and 1 label describing the angiographic disease status (target \in {0,1}). The set is quite balanced since 165 patients have label 1 and 138 have label 0.

The data have been pre-processed a little bit such as we keep only the most informative variables which are ['sex', 'cp', 'thalach', 'exang', 'oldpeak', 'ca', 'thal']. Then we split the dataset into a random train (75% of data) and test sets (the remaining 25%) and we scale them. A svm classifier has been learnt, and we obtain a classification accuracy around 91%.

Once the classification model is learnt, we are looking for explaining a particular prediction (a true one ;) ) based on the shap library developed by Lundberg. You need to install the shap library ( before running the code below: 

This figure illustrates features that push the prediction higher (in pink) and the ones which push the prediction lower (in blue) from a base value computed on the average model output on the training dataset.

For that true positive fitted answer, we can see that the main features which tend to push the probabilities towards 1 is mainly explained by 'sex', 'oldpeak', 'thalach' and 'exang', whereas the 'ca' feature tends to push down the prediction score. 

We can apply this explainer model to all correct predicted examples in the test set, as below:

The figure above stands for all individual feature contribution that have been stacked horizontally and ordered by output value. The 39 first predictions are correctly classified in class 1 and the 30 last ones are labeled and correctly classified in class 0. Note that the visualisation is interactive and we can see the effect of a particular feature by changing the y-axis in the menu of the left side of the figure. Symmetrically, you can change the x-axis menu in order to order the sample according to output values, similarities or SHAP values by feature.

It can also be very interested to have in one plot an overview of the distribution of SHAP values for each feature and an idea of their overall impact (Note that the example is still on the correct predicted sample):

In the first plot (subfigure a.), there are three kinds of information: in the x-axis you have the SHAP values of each feature described in the y-axis. Each line stands for the set of SHAP values computed for a specific feature and this is done for every features of your model. The third dimension is the color of points: it represents the feature value (pink for high value of the feature and blue for a low value). You can therefore see the dispersion of SHAP values according to features and also, their impact in the output model. For instance, high values of 'cp' feature implies high SHAP values predicted score and tends to push up the prediction whereas high values of 'thal' feature (pink points) tends to lower the predicted score.

The second plot (subfigure b.) is the Mean Absolute Value of SHAP values obtained for each feature. This can be seen as a summary of the left figure.


There are lots of approaches proposed in the literature to deal with interpretability/explainable models in the supervised context.

The main strength of the additive feature attribution model is its theoretical properties on one hand and on the other hand, its general framework which tends to explain most of explainable models developed in the literature. Different approximations algorithms have been proposed by Lundberg et al. in order to take advantage of the structure of the model and types of data to improve time computations. If you deal with Deep Neural Network or tree ensembles, I really encourage the reader to see more examples on the GitHub repository of the author:


[0] Shapley, Lloyd S. “A Value for N-Person Games.” Contributions to the Theory of Games 2 (28): 307–17, (1953).

[1] Lipovetsky, S. and Conklin, M. "Analysis of regression in game theory approach." Applied Stochastic Models in business and industry (17-4):319-330, (2001).

[2] Strumbelj et al., "A General Method for Visualizing and Explaining Black-Box Regression Models." Adaptive and Natural Computing Algorithms. ICANNGA 2011. Lecture Notes in Computer Science, vol 6594. (2011).

[3] Saabas et al., "Interpreting random forests",

[4] Springenberg et al., "Striving for simplicity: The all convolutional net", arXiv:1412.6806 (2014).

[5] Bach et al. "On Pixel-wise Explanations for Non-Linear Classifier Decisions by Layer-wise Relevance Propagation", PLOS ONE: (10-7): 130-140, (2015).

[6] Ribeiro et al. "Why should I Trust You ? Explaining the predictions of any classifier", Proceedings of the 22nd ACM SIGKDD: 1135-1144 (2016).

[7] Shrikumar et al. "Learning Important Features Through Propagating Activation Difference", Proceedings in ICML (2017).

[8] "Explainable and Interpretable Models in Computer Vision and Machine Learning", Springer Verlag, The Springer Series on Challenges in Machine Learning, 9783319981307 (2018).

[9] Sunderarajan et al., "Axiomatic Attribution for Deep Networks", Proceedings in ICML (2017).

[10] Montavon et al. "Explaining nonlinear classification decision with deep Taylor decomposition", Pattern Recognition (65):211-222 (2017).

[11] Lundberg et al., ''A unified approach to interpreting model predictions'', NIPS (2017).

All Posts

Almost done…

We just sent you an email. Please click the link in the email to confirm your subscription!