While feature importance shows what variables most affect predictions, partial dependence plots show how a feature affects predictions.
当特征重要性显示出哪一个特征对预测结果影响最大,部分特征图显示出一个特征如何影响预测结果。
This is useful to answer questions like:
这对回答以下问题非常有帮助。
Controlling for all other house features, what impact do longitude and latitude have on home prices? To restate this, how would similarly sized houses be priced in different areas?
Are predicted health differences between two groups due to differences in their diets, or due to some other factor?
两个人群中,健康因素是由于饮食不同而变化,还是由于其他因素?
If you are familiar with linear or logistic regression models, partial dependence plots can be interepreted similarly to the coefficients in those models. Though, partial dependence plots on sophisticated models can capture more complex patterns than coefficients from simple models. If you aren’t familiar with linear or logistic regressions, don’t worry about this comparison.
We will show a couple examples, explain the interpretation of these plots, and then review the code to create these plots.
我们会展示几个案例,然后去解释这些图形包括生成这些图形的代码。
How it Works
如何运行
Like permutation importance, partial dependence plots are calculated after a model has been fit. The model is fit on real data that has not been artificially manipulated in any way.
In our soccer example, teams may differ in many ways. How many passes they made, shots they took, goals they scored, etc. At first glance, it seems difficult to disentangle the effect of these features.
To see how partial plots separate out the effect of each feature, we start by considering a single row of data. For example, that row of data might represent a team that had the ball 50% of the time, made 100 passes, took 10 shots and scored 1 goal.
We will use the fitted model to predict our outcome (probability their player won “man of the game”). But we repeatedly alter the value for one variable to make a series of predictions. We could predict the outcome if the team had the ball only 40% of the time. We then predict with them having the ball 50% of the time. Then predict again for 60%. And so on. We trace out predicted outcomes (on the vertical axis) as we move from small values of ball possession to large values (on the horizontal axis).
In this description, we used only a single row of data. Interactions between features may cause the plot for a single row to be atypical. So, we repeat that mental experiment with multiple rows from the original dataset, and we plot the average predicted outcome on the vertical axis.
Model building isn’t our focus, so we won’t focus on the data exploration or model building code.
1 2 3 4 5 6 7 8 9 10 11 12
import numpy as np import pandas as pd from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.tree import DecisionTreeClassifier
data = pd.read_csv('../input/fifa-2018-match-statistics/FIFA 2018 Statistics.csv') y = (data['Man of the Match'] == "Yes") # Convert from string "Yes"/"No" to binary feature_names = [i for i in data.columns if data[i].dtype in [np.int64]] X = data[feature_names] train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1) tree_model = DecisionTreeClassifier(random_state=0, max_depth=5, min_samples_split=5).fit(train_X, train_y)
For the sake of explanation, our first example uses a Decision Tree which you can see below. In practice, you’ll use more sophistated models for real-world applications.
Leaves with children show their splitting criterion on the top
The pair of values at the bottom show the count of True values and False values for the target respectively, of data points in that node of the tree.
为了更方便去理解决策树
叶子上面的解释了如何从顶端分离下来的标准
在最底部上面的值表示了正确还是错误的个数分别有多少
Here is the code to create the Partial Dependence Plot using the PDPBox library.
1 2 3 4 5 6 7 8 9
from matplotlib import pyplot as plt from pdpbox import pdp, get_dataset, info_plots
# Create the data that we will plot pdp_goals = pdp.pdp_isolate(model=tree_model, dataset=val_X, model_features=feature_names, feature='Goal Scored')
# plot it pdp.pdp_plot(pdp_goals, 'Goal Scored') plt.show()
Here is another example plot:
1 2 3 4 5 6 7 8 9
from matplotlib import pyplot as plt from pdpbox import pdp, get_dataset, info_plots
# Create the data that we will plot pdp_goals = pdp.pdp_isolate(model=tree_model, dataset=val_X, model_features=feature_names, feature='Goal Scored')
# plot it pdp.pdp_plot(pdp_goals, 'Goal Scored') plt.show()
A few items are worth pointing out as you interpret this plot
The y axis is interpreted as change in the prediction from what it would be predicted at the baseline or leftmost value.
A blue shaded area indicates level of confidence
有一些值得在你图上解释的点:
Y 轴被解释成从baseline 或者最左边的值开始 预测上的变量
蓝色阴影代表了不同程度的置信区间
From this particular graph, we see that scoring a goal substantially increases your chances of winning “Player of The Game.” But extra goals beyond that appear to have little impact on predictions.
This graph seems too simple to represent reality. But that’s because the model is so simple. You should be able to see from the decision tree above that this is representing exactly the model’s structure.
You can easily compare the structure or implications of different models. Here is the same plot with a Random Forest model.
This model thinks you are more likely to win Player of The Game if your players run a total of 100km over the course of the game. Though running much more causes lower predictions.
In general, the smooth shape of this curve seems more plausible than the step function from the Decision Tree model. Though this dataset is small enough that we would be careful in how we interpret any model.
在这个模型中,你会更倾向于整场比赛跑了超过100km的队员得MVP,其实这个是比较低准确率的预测。
普遍来说,这种平稳的图形比决策树模型更加模糊。我们得非常小心去解释一些小数据模型
2D Partial Dependence Plots
2维的部分特征图
If you are curious about interactions between features, 2D partial dependence plots are also useful. An example may clarify what this.
We will again use the Decision Tree model for this graph. It will create an extremely simple plot, but you should be able to match what you see in the plot to the tree itself.
如果你对特征之间如何互相影响有兴趣的话,2维的部分特征图会非常有帮助。这里有一个例子可以解释这个。
我们会重复使用决策树模型。它会创造一个简单的图形,虽然简单,但是足够我们去对比到决策树图形本身。
1 2 3 4 5 6
# Similar to previous PDP plot except we use pdp_interact instead of pdp_isolate and pdp_interact_plot instead of pdp_isolate_plot features_to_plot = ['Goal Scored', 'Distance Covered (Kms)'] inter1 = pdp.pdp_interact(model=tree_model, dataset=val_X, model_features=feature_names, features=features_to_plot)
This graph shows predictions for any combination of Goals Scored and Distance covered.
For example, we see the highest predictions when a team scores at least 1 goal and they run a total distance close to 100km. If they score 0 goals, distance covered doesn’t matter. Can you see this by tracing through the decision tree with 0 goals?
But distance can impact predictions if they score goals. Make sure you can see this from the 2D partial dependence plot. Can you see this pattern in the decision tree too?
There are multiple ways to measure feature importance. Some approaches answer subtly different versions of the question above. Other approaches have documented shortcomings.
In this lesson, we’ll focus on permutation importance. Compared to most other approaches, permutation importance is:
在这门课中,我们会介绍交换排列计算重要性的方法。对比其他方法,交换排列计算重要性可以
Fast to calculate
更快计算
Widely used and understood
应用更普遍
Consistent with properties we would want a feature importance measure to have
保持特征重要性测量的一致性
How it Works 应用
Permutation importance uses models differently than anything you’ve seen so far, and many people find it confusing at first. So we’ll start with an example to make it more concrete.
We want to predict a person’s height when they become 20 years old, using data that is available at age 10.
我们想用这些人在十岁的数据来越策他们20岁的身高。
Our data includes useful features (height at age 10), features with little predictive power (socks owned), as well as some other features we won’t focus on in this explanation.
Permutation importance is calculated after a model has been fitted. So we won’t change the model or change what predictions we’d get for a given value of height, sock-count, etc.
交换排列计算重要性在训练数据之后被计算。所以我们并不会更改数据或是更改预测结果。
Instead we will ask the following question: If I randomly shuffle a single column of the validation data, leaving the target and all other columns in place, how would that affect the accuracy of predictions in that now-shuffled data?
Randomly re-ordering a single column should cause less accurate predictions, since the resulting data no longer corresponds to anything observed in the real world. Model accuracy especially suffers if we shuffle a column that the model relied on heavily for predictions. In this case, shuffling height at age 10 would cause terrible predictions. If we shuffled socks owned instead, the resulting predictions wouldn’t suffer nearly as much.
Shuffle the values in a single column, make predictions using the resulting dataset. Use these predictions and the true target values to calculate how much the loss function suffered from shuffling. That performance deterioration measures the importance of the variable you just shuffled.
Return the data to the original order (undoing the shuffle from step 2.) Now repeat step 2 with the next column in the dataset, until you have calculated the importance of each column.
然后还原数据,重复第2步直到我们计算出所有特征的结果影响的重要性。
Code Example
Our example will use a model that predicts whether a soccer/football team will have the “Man of the Game” winner based on the team’s statistics. The “Man of the Game” award is given to the best player in the game. Model-building isn’t our current focus, so the cell below loads the data and builds a rudimentary model.
关于预测足球队里谁能得到足球先生的预测。
1 2 3 4 5 6 7 8 9 10 11
import numpy as np import pandas as pd from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier
data = pd.read_csv('../input/fifa-2018-match-statistics/FIFA 2018 Statistics.csv') y = (data['Man of the Match'] == "Yes") # Convert from string "Yes"/"No" to binary feature_names = [i for i in data.columns if data[i].dtype in [np.int64]] X = data[feature_names] train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1) my_model = RandomForestClassifier(random_state=0).fit(train_X, train_y)
Here is how to calculate and show importances with the eli5 library:
1 2 3 4 5
import eli5 from eli5.sklearn import PermutationImportance
The values towards the top are the most important features, and those towards the bottom matter least.
在这张图中 最上面的是最重要的特征,反之最下面的是最不重要的特征。
The first number in each row shows how much model performance decreased with a random shuffling (in this case, using “accuracy” as the performance metric).
每一行中的第一个数字显示在一个随机排列之后整个模型的准确度下降了多少。
Like most things in data science, there is some randomness to the exact performance change from a shuffling a column. We measure the amount of randomness in our permutation importance calculation by repeating the process with multiple shuffles. The number after the ± measures how performance varied from one-reshuffling to the next.
You’ll occasionally see negative values for permutation importances. In those cases, the predictions on the shuffled (or noisy) data happened to be more accurate than the real data. This happens when the feature didn’t matter (should have had an importance close to 0), but random chance caused the predictions on shuffled data to be more accurate. This is more common with small datasets, like the one in this example, because there is more room for luck/chance.
In our example, the most important feature was Goals scored. That seems sensible. Soccer fans may have some intuition about whether the orderings of other variables are surprising or not.
The world has a lot of unreliable, disorganized and generally dirty data. You add a potential source of errors as you write preprocessing code. Add in the potential for target leakage and it is the norm rather than the exception to have errors at some point in a real data science projects.
Given the frequency and potentially disastrous consequences of bugs, debugging is one of the most valuable skills in data science. Understanding the patterns a model is finding will help you identify when those are at odds with your knowledge of the real world, and this is typically the first step in tracking down bugs.
Feature engineering is usually the most effective way to improve model accuracy. Feature engineering usually involves repeatedly creating new features using transformations of your raw data or features you have previously created.
Sometimes you can go through this process using nothing but intuition about the underlying topic. But you’ll need more direction when you have 100s of raw features or when you lack background knowledge about the topic you are working on.
A Kaggle competition to predict loan defaults gives an extreme example. This competition had 100s of raw features. For privacy reasons, the features had names like f1, f2, f3 rather than common English names. This simulated a scenario where you have little intuition about the raw data.
One competitor found that the difference between two of the features, specificallyf527 - f528, created a very powerful new feature. Models including that difference as a feature were far better than models without it. But how might you think of creating this variable when you start with hundreds of variables?
The techniques you’ll learn in this course would make it transparent that f527 and f528 are important features, and that their role is tightly entangled. This will direct you to consider transformations of these two variables, and likely find the “golden feature” of f527 - f528.
As an increasing number of datasets start with 100s or 1000s of raw features, this approach is becoming increasingly important.
随着特征量的增加从100 到1000 数量级,这个方法变得尤为重要。
Directing Future Data Collection
对收集新数据的指向
You have no control over datasets you download online. But many businesses and organizations using data science have opportunities to expand what types of data they collect. Collecting new types of data can be expensive or inconvenient, so they only want to do this if they know it will be worthwhile. Model-based insights give you a good understanding of the value of features you currently have, which will help you reason about what new values may be most helpful.
Some decisions are made automatically by models. Amazon doesn’t have humans (or elves) scurry to decide what to show you whenever you go to their website. But many important decisions are made by humans. For these decisions, insights can be more valuable than predictions.
Many people won’t assume they can trust your model for important decisions without verifying some basic facts. This is a smart precaution given the frequency of data errors. In practice, showing insights that fit their general understanding of the problem will help build trust, even among people with little deep knowledge of data science.