【模型可解释笔记】 模型可解释之Shap
简介
SHAP(SHapley Additive exPlanations)是一种博弈论方法, 用于解释任何机器学习模型的输出.
Shapley value
Shapley value
起源于合作博弈论, 诺贝尔经济学奖得主
Lloyd S. Shapley
于 1953 年针对如下问题,
提出一个合理的计算方法, 每个参与者分配到的数额称为
Shapley value
.
博弈问题:
N
个人合作, 创造v(N)
的价值, 对所创造的价值进行分配的问题.
满足分配四大原则:
- 有效性(Efficiency): 所有价值均被分配
- 对称性(Symmetry): 假设两者可相互替代, 其收益应该一样
- Dummy: 未贡献则收益为0
- 可加性(Additivity): 如果同一批人完成两项任务, 那么两项任务的收益一起分配应该和分开分配结果一致
Shapley value
公式为:
\[ \phi_i(N, v) = \sum_{S \in N \setminus \{ i \}} \frac{|S|!(|N| - |s| - 1)!}{|N|!}(v(S \cup \{i\}) - v(S)) \]
计算示例参见卫宫切嗣 的知乎回答
XGBOOST在Spark的Shap计算验证
验证环境:
- spark: 3.0.1 scala: 2.12.10 xgboost4j-spark: 1.5.0
- python: 3.6 xgboost: 1.5.0 shap: 0.38.1
结论: 测试随机生成1W条数据, 每条数据3个特征, 单条条数三个, 单条数据最大绝对误差为9.99e-09(scala, python结果同时8位精度), 计算结果相同
val originData = spark.createDataFrame(
(new Array[Int](100000)).map(x => {
val f1 = math.floor(math.random() * 10) + 1
val f2 = math.floor(math.random() * 100) + 1
val f3 = math.floor(math.random() * 1000) + 1
val label = if (math.sin(f1 + f2 + f3) > 0) 1 else 0
(f1, f2, f3, label)
})
).toDF("f1", "f2", "f3", "label")
import org.apache.spark.ml.feature.VectorAssembler
val assembler = new VectorAssembler().setInputCols(Array[String]("f1", "f2", "f3")).setOutputCol("features")
val featureData = assembler.transform(originData)
import ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier
val xgb = new XGBoostClassifier().setLabelCol("label").setFeaturesCol("features").setObjective("binary:logistic").setNumRound(10)
val xgbModel = xgb.fit(featureData)
xgbModel.setContribPredictionCol("raw_feature_shap")
import org.apache.spark.sql.functions.{col, udf}
val formatShap = udf((x: Seq[Double]) => { x.map(x => x.formatted("%.8f")).mkString(",") })
import org.apache.spark.ml.linalg.{Vector => oldVectors}
val formatFeature = udf((x: oldVectors) => { x.toArray.map(x => x.formatted("%.2f")).mkString(",") })
val result = xgbModel.transform(featureData).withColumn(
"feature_shap", formatShap(col("raw_feature_shap"))
).withColumn(
"feature", formatFeature(col("features"))
).select("feature", "feature_shap").repartition(1).cache()
result.write.option("header", "true").option("delimiter","\t").csv("result.txt")
xgbModel.nativeBooster.saveModel("model.pkl")
// hadoop fs -get ./result.txt/part-00000-*.csv result.csv
// hadoop fs -rm -r ./result.txt
import shap
import xgboost as xgb
import numpy as np
import pandas as pd
model = xgb.Booster({'nthread': 4})
model.load_model("model.pkl")
explainer = shap.TreeExplainer(model)
scala_result = pd.read_csv('result.csv', sep='\t')
scala_result.head()
# feature feature_shap
# 0 1.00,40.00,855.00 -0.02117009,0.01038337,-0.04706362,-0.00467462
# 1 6.00,35.00,584.00 -0.00913048,-0.00514835,0.01143846,-0.00467462
# 2 9.00,83.00,366.00 -0.00465891,0.00696741,-0.00735373,-0.00467462
# 3 5.00,24.00,788.00 -0.01884035,0.02896836,-0.00291048,-0.00467462
# 4 8.00,81.00,42.00 0.01391302,-0.01803247,0.04461136,-0.00467462
feature = np.array(scala_result['feature'].str.split(',', expand = True).values, dtype='float')
python_result = explainer.shap_values(feature).tolist()
explainer.expected_value
# -0.0046746163
for index in range(5):
print([format(x, '.1f') for x in feature[index]], [format(x, '.8f') for x in python_result[index]])
# ['1.0', '40.0', '855.0'] ['-0.02117009', '0.01038337', '-0.04706362']
# ['6.0', '35.0', '584.0'] ['-0.00913048', '-0.00514835', '0.01143846']
# ['9.0', '83.0', '366.0'] ['-0.00465891', '0.00696741', '-0.00735373']
# ['5.0', '24.0', '788.0'] ['-0.01884035', '0.02896836', '-0.00291048']
# ['8.0', '81.0', '42.0'] ['0.01391302', '-0.01803247', '0.04461136']
np.max(np.sum(np.abs(np.array(scala_result['feature_shap'].str.split(',', expand = True).values[:, :3], dtype='float') - python_result), axis = 1))
# 1.4835780859888403e-08
np.max(
np.sum(
np.abs(np.array(scala_result['feature_shap'].str.split(',', expand = True).values[:, :3], dtype='float') - np.around(python_result, 8)),
axis = 1
)
)
# 9.999999994736442e-09
踩坑记录
坑一: LGB 基准值问题
问题描述: 直接调用 shap.TreeExplainer(model)
取解析模型,
其基准值基于 Train data
# 预测集
X = test_data.values
# 基于预测集的解释器
explainer = shap.TreeExplainer(model, X)
坑二: LGB 输出值与SHAP和不等
问题描述: TreeExplainer
无法直接解析 LGB
模型的概率输出
方法一: Sigmoid 还原概率结果
\[ y_{prob} = \frac{1}{1 + e^{-y_{raw}}} \rightarrow y_{raw} = \log \frac{y}{1 - y_{prob}} \]
方法二: 直接解析概率结果
explainer = shap.TreeExplainer(
model, data = X,
model_output = "probability",
feature_dependence = "independent"
)
参考资料
【模型可解释笔记】 模型可解释之Shap
https://www.windism.cn/1116592221.html