fixed PR comments. Explained script

DvirDukhan · DvirDukhan · commit 0bf059bac88c · 2021-09-22T06:09:13.000+03:00
diff --git a/notebooks/shapley_explainability/XGBoostGenericShapleyFraudDetection.ipynb b/notebooks/shapley_explainability/XGBoostGenericShapleyFraudDetection.ipynb
@@ -11,9 +11,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "In this example we develop a small fraud detection model for credit card transactions based on XGBoost, export it to TorchScript using Hummingbird (https://github.com/microsoft/hummingbird) and run Shapley Value Sampling explanations (see https://captum.ai/api/shapley_value_sampling.html for reference) on it, also exported to TorchScript.\n",
+    "In this example we develop a small fraud detection model for credit card transactions based on XGBoost, export it to TorchScript using Hummingbird (https://github.com/microsoft/hummingbird) and run Shapley Value Sampling explanations (see https://captum.ai/api/shapley_value_sampling.html for reference) on it, via torch script.\n",
     "\n",
-    "We load both the original model and the explainability model in RedisAI and trigger them in a DAG."
+    "We load both the original model and the explainability script in RedisAI and trigger them in a DAG."
    ]
   },
   {
@@ -39,7 +39,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 45,
+   "execution_count": 1,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -102,23 +102,23 @@
      "name": "stderr",
      "output_type": "stream",
      "text": [
-      "/home/dvirdukhan/.local/lib/python3.8/site-packages/xgboost/sklearn.py:1146: UserWarning: The use of label encoder in XGBClassifier is deprecated and will be removed in a future release. To remove this warning, do the following: 1) Pass option use_label_encoder=False when constructing XGBClassifier object; and 2) Encode your labels (y) as integers starting with 0, i.e. 0, 1, 2, ..., [num_class - 1].\n",
+      "/home/dvirdukhan/Code/redisai-examples/venv/lib/python3.8/site-packages/xgboost/sklearn.py:1146: UserWarning: The use of label encoder in XGBClassifier is deprecated and will be removed in a future release. To remove this warning, do the following: 1) Pass option use_label_encoder=False when constructing XGBClassifier object; and 2) Encode your labels (y) as integers starting with 0, i.e. 0, 1, 2, ..., [num_class - 1].\n",
       "  warnings.warn(label_encoder_deprecation_msg, UserWarning)\n"
      ]
     },
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "[10:32:38] WARNING: ../src/learner.cc:573: \n",
+      "[05:49:05] WARNING: ../src/learner.cc:573: \n",
       "Parameters: { \"label_encoder\" } might not be used.\n",
       "\n",
       "  This may not be accurate due to some parameters are only used in language bindings but\n",
       "  passed down to XGBoost core.  Or some parameters are not used but slip through this\n",
       "  verification. Please open an issue if you find above cases.\n",
       "\n",
       "\n",
-      "[10:32:38] WARNING: ../src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n"
+      "[05:49:05] WARNING: ../src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n"
      ]
     },
     {
@@ -223,7 +223,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "We are interested to explore are casesof fraud, so we extract them from the test set."
+    "We are interested to explore are cases of fraud, so we extract them from the test set."
    ]
   },
   {
@@ -302,7 +302,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 12,
+   "execution_count": 10,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -311,7 +311,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 13,
+   "execution_count": 11,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -331,7 +331,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 14,
+   "execution_count": 12,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -349,7 +349,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 15,
+   "execution_count": 13,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -369,7 +369,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 16,
+   "execution_count": 14,
    "metadata": {},
    "outputs": [
     {
@@ -378,7 +378,7 @@
        "True"
       ]
      },
-     "execution_count": 16,
+     "execution_count": 14,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -389,6 +389,44 @@
     "torch.equal(loaded_output_classes, xgboost_output_classes)"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Explainer Script"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The script `torch_shapely.py` is a torch script degined specificly running on RedisAI, and utilizes RedisAI extension for torch script, that allows to run any model stored in RedisAI from within the script. Let's go over the details:\n",
+    "\n",
+    "In RedisAI, each entry point (function in script) should have the signature:\n",
+    "`function_name(tensors: List[Tensor], keys: List[str], args: List[str]):`\n",
+    "In our case our entry point is `shapely_sample(tensors: List[Tensor], keys: List[str], args: List[str]):` and the parameters are:\n",
+    "```\n",
+    "Tensors:\n",
+    "    tensors[0] - x : Input tensor to the model\n",
+    "    tensors[1] - baselines : Optional - reference values which replace each feature when\n",
+    "        ablated; if no baselines are provided, baselines are set\n",
+    "        to all zeros\n",
+    "\n",
+    "Keys:\n",
+    "    keys[0] - model_key: Redis key name where the model is stored as RedisAI model.\n",
+    "        \n",
+    "Args:\n",
+    "    args[0] - n_samples: number of random feature permutations performed\n",
+    "    args[1] - number_of_outputs - number of model outputs\n",
+    "    args[2] - output_tensor_index - index of the tested output tensor\n",
+    "    args[3] - Optional - target: output indices for which Shapley Value Sampling is\n",
+    "            computed; if model returns a single scalar, target can be\n",
+    "            None\n",
+    "```\n",
+    "\n",
+    "The script will create `n_samples` amount of permutations of the input features. For each permutation it will check for each feature what was its contribution to the result by running the model repeatedly on a new subset of input features.\n"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -400,12 +438,12 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "At this point we can load the models we exported into RedisAI and serve them from there. After making sure RedisAI is running, we initialize the client."
+    "At this point we can load the model we exported into RedisAI and serve it from there. We will also load the `torch_shapely.py` script, that allows calculating the Shapely value of a model, from within RedisAI. After making sure RedisAI is running, we initialize the client."
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 34,
+   "execution_count": 15,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -418,12 +456,12 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "We read the model and the explainer from the saved TorchScript."
+    "We read the model and the script."
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 64,
+   "execution_count": 16,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -438,12 +476,12 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "We load both models into RedisAI."
+    "We load both movel and script into RedisAI."
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 65,
+   "execution_count": 17,
    "metadata": {},
    "outputs": [
     {
@@ -452,7 +490,7 @@
        "'OK'"
       ]
      },
-     "execution_count": 65,
+     "execution_count": 17,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -466,32 +504,12 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "All, set, it's now test time. We reuse our `X_test_fraud` NumPy array we created previously. We set it, run both models, and get predictions and explanations as arrays."
+    "All set, it's now test time. We reuse our `X_test_fraud` NumPy array we created previously. We set it, and run the Shapley script and get explanations as arrays."
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 66,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "rai.tensorset(\"fraud_input\", X_test_fraud, dtype=\"float\")\n",
-    "\n",
-    "rai.scriptexecute(\"shapely_script\", \"shapely_sample\", inputs = [\"fraud_input\"], keys = [\"fraud_detection_model\"], args = [\"20\", \"2\", \"0\"], outputs=[\"fraud_explanations\"])\n",
-    "\n",
-    "rai_expl = rai.tensorget(\"fraud_explanations\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "We check whether the winning feature is consistent to what we found earlier."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 67,
+   "execution_count": 18,
    "metadata": {},
    "outputs": [
     {
@@ -503,33 +521,15 @@
     }
    ],
    "source": [
+    "rai.tensorset(\"fraud_input\", X_test_fraud, dtype=\"float\")\n",
+    "\n",
+    "rai.scriptexecute(\"shapely_script\", \"shapely_sample\", inputs = [\"fraud_input\"], keys = [\"fraud_detection_model\"], args = [\"20\", \"2\", \"0\"], outputs=[\"fraud_explanations\"])\n",
+    "\n",
+    "rai_expl = rai.tensorget(\"fraud_explanations\")\n",
+    "\n",
     "winning_feature_redisai = np.argmax(rai_expl[0], axis=0)\n",
     "\n",
-    "print(\"Winning feature: %d\" % winning_feature_redisai)\n",
-    "\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 71,
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "text/plain": [
-       "array([ 0.  , -0.05,  0.  ,  0.05,  0.2 ,  0.  ,  0.1 ,  0.  ,  0.  ,\n",
-       "       -0.05,  0.15,  0.  ,  0.1 ,  0.  ,  0.5 ,  0.  ,  0.05, -0.1 ,\n",
-       "        0.  ,  0.  ,  0.05,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ,\n",
-       "        0.  ,  0.  ,  0.  ])"
-      ]
-     },
-     "execution_count": 71,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "rai_expl[0]"
+    "print(\"Winning feature: %d\" % winning_feature_redisai)"
    ]
   },
   {
@@ -541,16 +541,16 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 74,
+   "execution_count": 19,
    "metadata": {},
    "outputs": [
     {
      "data": {
       "text/plain": [
-       "<redisai.dag.Dag at 0x7fb118524d30>"
+       "<redisai.dag.Dag at 0x7f8a941a52e0>"
       ]
      },
-     "execution_count": 74,
+     "execution_count": 19,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -573,7 +573,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 75,
+   "execution_count": 20,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -584,7 +584,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 76,
+   "execution_count": 21,
    "metadata": {},
    "outputs": [
     {
@@ -600,7 +600,7 @@
        "       1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1])"
       ]
      },
-     "execution_count": 76,
+     "execution_count": 21,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -618,7 +618,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 77,
+   "execution_count": 22,
    "metadata": {},
    "outputs": [
     {
@@ -638,7 +638,7 @@
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "Python 3 (ipykernel)",
    "language": "python",
    "name": "python3"
   },
@@ -652,7 +652,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.8.5"
+   "version": "3.8.10"
   }
  },
  "nbformat": 4,
diff --git a/notebooks/shapley_explainability/torch_shapely.py b/notebooks/shapley_explainability/torch_shapely.py
@@ -51,7 +51,7 @@ def shapely_sample(tensors: List[Tensor], keys: List[str], args: List[str]):
     n_samples = int(args[0])
     number_of_outputs = int(args[1])
     output_tensor_index = int(args[2])
-    if(len(args)==4):
+    if(len(args) == 4):
         target = int(args[3])
     else:
         target = None