【模型可解释笔记】 模型可解释之Shap

简介

SHAP(SHapley Additive exPlanations)是一种博弈论方法, 用于解释任何机器学习模型的输出.

Shapley value

Shapley value 起源于合作博弈论, 诺贝尔经济学奖得主 Lloyd S. Shapley 于 1953 年针对如下问题, 提出一个合理的计算方法, 每个参与者分配到的数额称为 Shapley value.

博弈问题: N 个人合作, 创造 v(N) 的价值, 对所创造的价值进行分配的问题.

满足分配四大原则:

  1. 有效性(Efficiency): 所有价值均被分配
  2. 对称性(Symmetry): 假设两者可相互替代, 其收益应该一样
  3. Dummy: 未贡献则收益为0
  4. 可加性(Additivity): 如果同一批人完成两项任务, 那么两项任务的收益一起分配应该和分开分配结果一致

Shapley value 公式为:

\[ \phi_i(N, v) = \sum_{S \in N \setminus \{ i \}} \frac{|S|!(|N| - |s| - 1)!}{|N|!}(v(S \cup \{i\}) - v(S)) \]

计算示例参见卫宫切嗣 的知乎回答

XGBOOST在Spark的Shap计算验证

验证环境:

  • spark: 3.0.1 scala: 2.12.10 xgboost4j-spark: 1.5.0
  • python: 3.6 xgboost: 1.5.0 shap: 0.38.1

结论: 测试随机生成1W条数据, 每条数据3个特征, 单条条数三个, 单条数据最大绝对误差为9.99e-09(scala, python结果同时8位精度), 计算结果相同

val originData =  spark.createDataFrame(
    (new Array[Int](100000)).map(x => {
    val f1 = math.floor(math.random() * 10) + 1
    val f2 = math.floor(math.random() * 100) + 1
    val f3 = math.floor(math.random() * 1000) + 1
    val label = if (math.sin(f1 + f2 + f3) > 0) 1 else 0
    (f1, f2, f3, label)
    })
).toDF("f1", "f2", "f3", "label")
 
import org.apache.spark.ml.feature.VectorAssembler
val assembler = new VectorAssembler().setInputCols(Array[String]("f1", "f2", "f3")).setOutputCol("features")
val featureData = assembler.transform(originData)
 
import ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier
val xgb = new XGBoostClassifier().setLabelCol("label").setFeaturesCol("features").setObjective("binary:logistic").setNumRound(10)
 
val xgbModel = xgb.fit(featureData)
xgbModel.setContribPredictionCol("raw_feature_shap")
 
import org.apache.spark.sql.functions.{col, udf}
val formatShap = udf((x: Seq[Double]) => { x.map(x => x.formatted("%.8f")).mkString(",") })
import org.apache.spark.ml.linalg.{Vector => oldVectors}
val formatFeature = udf((x: oldVectors) => { x.toArray.map(x => x.formatted("%.2f")).mkString(",") })
val result = xgbModel.transform(featureData).withColumn(
    "feature_shap", formatShap(col("raw_feature_shap"))
).withColumn(
    "feature", formatFeature(col("features"))
).select("feature", "feature_shap").repartition(1).cache()
 
result.write.option("header", "true").option("delimiter","\t").csv("result.txt")
xgbModel.nativeBooster.saveModel("model.pkl")
 
 
// hadoop fs -get ./result.txt/part-00000-*.csv result.csv
// hadoop fs -rm -r ./result.txt
import shap
import xgboost as xgb
import numpy as np
import pandas as pd
 
model = xgb.Booster({'nthread': 4})
model.load_model("model.pkl")
explainer = shap.TreeExplainer(model)
 
scala_result = pd.read_csv('result.csv', sep='\t')
scala_result.head()
 
#              feature                                    feature_shap
# 0  1.00,40.00,855.00  -0.02117009,0.01038337,-0.04706362,-0.00467462
# 1  6.00,35.00,584.00  -0.00913048,-0.00514835,0.01143846,-0.00467462
# 2  9.00,83.00,366.00  -0.00465891,0.00696741,-0.00735373,-0.00467462
# 3  5.00,24.00,788.00  -0.01884035,0.02896836,-0.00291048,-0.00467462
# 4   8.00,81.00,42.00   0.01391302,-0.01803247,0.04461136,-0.00467462
 
feature = np.array(scala_result['feature'].str.split(',', expand = True).values, dtype='float')
python_result = explainer.shap_values(feature).tolist()
explainer.expected_value
# -0.0046746163
 
for index in range(5):
  print([format(x, '.1f') for x in feature[index]], [format(x, '.8f') for x in python_result[index]])
 
# ['1.0', '40.0', '855.0'] ['-0.02117009', '0.01038337', '-0.04706362']
# ['6.0', '35.0', '584.0'] ['-0.00913048', '-0.00514835', '0.01143846']
# ['9.0', '83.0', '366.0'] ['-0.00465891', '0.00696741', '-0.00735373']
# ['5.0', '24.0', '788.0'] ['-0.01884035', '0.02896836', '-0.00291048']
# ['8.0', '81.0', '42.0'] ['0.01391302', '-0.01803247', '0.04461136']
 
np.max(np.sum(np.abs(np.array(scala_result['feature_shap'].str.split(',', expand = True).values[:, :3], dtype='float') - python_result), axis = 1))
# 1.4835780859888403e-08
np.max(
    np.sum(
        np.abs(np.array(scala_result['feature_shap'].str.split(',', expand = True).values[:, :3], dtype='float') - np.around(python_result, 8)),
        axis = 1
    )
)
# 9.999999994736442e-09

踩坑记录

坑一: LGB 基准值问题

问题描述: 直接调用 shap.TreeExplainer(model) 取解析模型, 其基准值基于 Train data

# 预测集
X = test_data.values
# 基于预测集的解释器
explainer = shap.TreeExplainer(model, X) 

坑二: LGB 输出值与SHAP和不等

问题描述: TreeExplainer 无法直接解析 LGB 模型的概率输出

方法一: Sigmoid 还原概率结果

\[ y_{prob} = \frac{1}{1 + e^{-y_{raw}}} \rightarrow y_{raw} = \log \frac{y}{1 - y_{prob}} \]

方法二: 直接解析概率结果

explainer = shap.TreeExplainer(
    model, data = X,
    model_output = "probability",
    feature_dependence = "independent"
)

参考资料


【模型可解释笔记】 模型可解释之Shap
https://www.windism.cn/1116592221.html
作者
windism
发布于
2022年3月3日
许可协议