PnPXAI: A Universal Framework for the Automated Generation of Objective Aligned and Reliable Explanations

This repository contains the official implementation and experimental code for the paper "PnPXAI: A Universal Framework for the Automated Generation of Objective Aligned and Reliable Explanations".

PnPXAI is a framework designed to overcome the challenges in generating reliable explanations for complex AI models through an end-to-end automated pipeline. This pipeline includes model architecture detection, applicable explainer recommendation, objective-driven hyperparameter optimization (HPO), and evaluation. This repository provides the necessary code to set up the environment and reproduce the key experiments presented in the paper, demonstrating PnPXAI's effectiveness across various tasks and modalities.

Setup

We provide two ways to set up the environment: using Docker (recommended for exact reproducibility) or manual installation.

Option 1: Using Docker (Recommended) 🐳

Clone this repository:

git clone https://github.com/OpenXAIProject/pnpxai-experiments.git
cd pnpxai-experiments

Build the Docker image: The provided Dockerfile includes all necessary system dependencies and Python packages.
```
docker build -t seongun/ubuntu22.04-cuda12.2.2-cudnn8-pytorch2.1:base .
```
Alternatively, you can pull the pre-built image from Docker hub:
```
docker pull seongun/ubuntu22.04-cuda12.2.2-cudnn8-pytorch2.1:base
```

Run the Docker container: This command starts an interactive container, mounts the project code, and assigns GPUs. Adjust mount paths and --gpus devices as needed. The ImageNet paths are recommended for running Experiment 2.

docker run -it \
    -v "$(pwd)":/root/pnpxai-experiments \
    -v /PATH_TO_IMAGENET/ImageNet1k:/root/pnpxai-experiments/data/ImageNet/ImageNet1k:ro \
    -v /PATH_TO_IMAGENET/ImageNet1k_info:/root/pnpxai-experiments/data/ImageNet/ImageNet1k_info:ro \
    --gpus '"device=0"' \ # Example: Assign GPU 0
    --name pnpxai_exp \
    seongun/ubuntu22.04-cuda12.2.2-cudnn8-pytorch2.1:base

Install the local package inside the container: Once inside the container, navigate to the project directory and install the pnpxai-experiments package in editable mode.
```
cd /root/pnpxai-experiments
pip install -e .
```
You are now ready to run the experiments.

Option 2: Manual Installation

Clone this repository:

git clone https://github.com/OpenXAIProject/pnpxai-experiments.git
cd pnpxai-experiments

Create an environment: (Using Conda or venv is recommended)

# Using Conda (example)
conda create -n pnpxai_env python=3.10
conda activate pnpxai_env

Install dependencies: Install PyTorch matching your CUDA version (see PyTorch website) and then install required packages.

# Example: Install PyTorch for CUDA 12.1 (adjust if needed)
# pip install torch torchvision --index-url [https://download.pytorch.org/whl/cu121](https://download.pytorch.org/whl/cu121)

# Install other requirements (assuming requirements.txt exists)
pip install -r requirements.txt

Install the local package: Install the pnpxai-experiments code as an editable package.
```
pip install -e .
```

Running Experiments

This repository contains code for various experiments presented in the PnPXAI paper. Each experiment can typically be run using scripts located in experiments/scripts/.

Experiment 1: ImageNet Explanation

This experiment qualitatively analyzes the effect of HPO (optimizing for AbPC) on explanations of LRPUniformEpsilon, IntegratedGradients, and KernelShap on ImageNet1k samples, evaluating the change in faithfulness metrics (MoRF, LeRF, AbPC).

Data and Model

Data (ImageNet): The subset of ImageNet1k for this experiment, one sample per label, a total of 1,000 samples, is hosted on Hugging Face Hub: ➡️ geonhyeongkim/imagenet-samples-for-pnpxai-experiments. The script automatically downloads the necessary files when first executed. For more details on the data loading process, refer to the get_imagenet_samples_from_hf function within experiments/utils/datasets.py.
Model (ResNet-18): This script uses a standard ResNet-18 model pre-trained on ImageNet, loaded directly from torchvision.models.

Usage

python -m experiments.scripts.analyze_imagenet_hpo \
    --data_id 72 \
    --save_dir results/analyze_imagenet_hpo/ \
    --seed 42 \
    --n_trials 100 \
    --analyze \
    --visualize

Arguments

--data_id <INSTANCE_ID>: The specific index (0-999) of the data instance from the Hugging Face dataset to analyze.
--save_dir <SAVE_DIR>: Data directory where experiment results are saved.
--n_trials <NUM_TRIALS>: Number of trials for hyperparameter optimization (HPO). Defaults to 100.
--analyze: Runs the HPO process and saves the raw results to <SAVE_DIR>/raw/<INSTANCE_ID>.pkl.
--visualize: Loads the previously saved results for the specified --data_id and generates a visualization PDF comparing default vs. optimized attributions and metrics. Saves the figure to <SAVE_DIR>/figures/<INSTANCE_ID>.pdf. Requires results to be saved first (using --analyze).

Output

Results will be saved under the <SAVE_DIR> directory, organized by data instance ID.

Experiment 2: Hyperparameter Impact Analysis

This experiment evaluates a grid of hyperparameter combinations for various explainers on a subset of the ImageNet validation set and generates plots comparing the impact on evaluation metrics.

Data and Model

Data (ImageNet): This script requires the ImageNet 1k dataset. You must download it from the official site (requires registration). The script assumes the validation set is organized in the data/ImageNet/ directory as follows. The docker run command in the Setup section already includes the recommended mounts for these paths.
```
data/
└── ImageNet/
    ├── ImageNet1k/
    │   └── val/
    │       └── val/
    │           └── ILSVRC2012_val_IDX.JPEG
    └── ImageNet1k_info/
        ├── ImageNet_class_index.json
        └── ImageNet_val_label.txt
```
Model (ResNet-18): This script uses a standard ResNet-18 model pre-trained on ImageNet, loaded directly from torchvision.models.

Usage

python -m experiments.scripts.analyze_imagenet_hpo_impact \
    --data_dir data/ImageNet \
    --batch_size 4 \
    --analyze \
    --visualize \
    --eval_explainer smooth_grad lrp_epsilon_gamma_box guided_grad_cam integrated_gradients

Arguments

--data_dir <PATH>: Path to the root ImageNet directory (which contains ImageNet1k and ImageNet1k_info).
--analyze: Runs the full grid search evaluation and saves the raw metric results.
--visualize: loads the raw results and generates the final plots.
--eval_explainer <LIST>: A space-separated list of explainers to analyze (e.g., smooth_grad, guided_grad_cam).

Note on `batch_size`

This experiment runs on a 128-image subset (default for --data_to 128).
Most explainers (e.g., smooth_grad, guided_grad_cam, lrp_epsilon_gamma_box) can run with a large batch size, e.g., --batch_size 128.
integrated_gradients is memory-intensive and may require a smaller batch size (e.g., --batch_size 4), depending on your GPU.

Output

Raw results (.pkl, .csv) will be saved under the results/hpo_impact_imagenet/raw/resnet18/ directory, and the generated figures (.pdf) will be saved in results/hpo_impact_imagenet/figures/resnet18/.

Experiment 3: Liver Tumor Explanation

This experiment analyzes the effect of HPO (optimizing for AbPC) on explanations for a liver tumor CT slice, evaluating the change in ground truth agreement (Relevance Mass/Rank Accuracy).

Data and Model

Data (Liver Tumor): The Liver Tumor Classification dataset used in this experiment is hosted on Hugging Face Hub: ➡️ seongun/liver-tumor-classification. This dataset contains individual 2D CT scan slices derived from the original LiTS dataset. The script automatically downloads the necessary files when first executed. For more details on the data loading process, refer to the get_livertumor_dataset_from_hf function within experiments/utils/datasets.py
Model (ResNet-50 Liver Tumor): The pre-trained ResNet-50 model adapted for this task is hosted on Hugging Face Hub: ➡️ seongun/resnet50-livertumor. Similar to the dataset, the script automatically downloads the model weights. The model architecture is defined in experiments/models/liver_tumor.py. For more details on model loading, refer to the get_livertumor_model_from_hf function within experiments/utils/models.py.

Usage

python -m experiments.scripts.analyze_livertumor_hpo \
    --data_id 2280 \
    --n_trials 100 \
    --analyze \
    --visualize

Arguments

--data_id <INSTANCE_ID>: The specific index (e.g., 2280) of the data instance from the Hugging Face dataset to analyze.
--n_trials <NUM_TRIALS>: Number of trials for hyperparameter optimization (HPO). Defaults to 100.
--analyze: Runs the HPO process and saves the raw results (.pkl files for default run, and optimized run) to results/hpo_analysis_livertumor/raw/<INSTANCE_ID>/.
--visualize: Loads the previously saved results for the specified --data_id and generates a visualization PDF comparing default vs. optimized attributions and metrics. Saves the figure to results/hpo_analysis_livertumor/figures/<INSTANCE_ID>.pdf. Requires results to be saved first (using --analyze).

Output

Results will be saved under the results/hpo_analysis_livertumor/ directory, organized by data instance ID.

Experiment 4: Acute Kidney Injury (AKI) Explanation

This experiment analyzes the effect of HPO (optimizing for AbPC) on explanations for medical data for acute kidney injury (AKI) detection, evaluating the change in ground truth agreement (Relevance Mass/Rank Accuracy).

Data and Model

Data (MIMIC III): The MIMIC III dataset used in this experiment is hosted on PhysioNet: ➡️ MIMIC-III Clinical Database. This work utilizes the latest version of MIMIC III dataset. To use the analysis script, the dataset needs to be downloaded, built and formatted. Having downloaded the dataset from the official source MIMIC III dataset, users are prompted to build the PostgreSQL version of the dataset with the official Github code. Subsequently, the built dataset can be formatted with the set of scripts listed in /data/mimiciii directory. Thorough instructions on data transformation are provided in README.md. Provided that formatted data is generated, the analysis script loads the necessary files when first executed. For more details on the data loading process, refer to the get_aki_dataset function within experiments/utils/datasets.py.
Model (AKI Classifier): The pre-trained Linear model adapted for this task is hosted on Hugging Face Hub: ➡️ enver1323/aki-classifier. Similar to the dataset, the script automatically downloads the model weights. The model architecture is defined in experiments/models/aki.py. For more details on model loading, refer to the get_aki_model_from_hf function within experiments/utils/models.py.

Usage

python -m experiments.scripts.analyze_aki_hpo \
    --n_trials 100 \
    --analyze \
    --visualize

Arguments

--n_trials <NUM_TRIALS>: Number of trials for hyperparameter optimization (HPO). Defaults to 20.
--analyze: Runs the HPO process and saves the top-K columns as well as attributions (.json and .npy files correspondingly for default run, and optimized run) to results/hpo_analysis_aki/topk/<EXPLAINER>/.
--visualize: Loads the previously saved results and generates a visualization PDF comparing default vs. optimized attributions and metrics. Saves the figure to results/hpo_analysis_aki/explanation_summary.pdf. Requires results to be saved first (using --analyze).

Output

Results will be saved under the results/hpo_analysis_aki/ directory, organized by explainer name.

Experiment 5: ECG Explanation

This experiment analyzes the effects of HPO (optimizing for multiple metrics) on explanations for a ECG time series dataset, evaluating the change in metric values.

Data and Model

Data (ECG): The ECG dataset used in this experiment is hosted on Hugging Face Hub: ➡️ enver1323/ucr-twoleadecg. This dataset contains time series ecg segments derived from the original UCR dataset. The script automatically downloads the necessary files when first executed. For more details on the data loading process, refer to the get_ecg_dataset_from_hf function within experiments/utils/datasets.py
Model (ResNetPlus): The pre-trained ResNetPlus model adapted for this task is hosted on Hugging Face Hub: ➡️ enver1323/resnetplus-classification-ecg. Similar to the dataset, the script automatically downloads the model weights. The model architecture is defined in experiments/models/ecg/resnet_plus.py. For more details on model loading, refer to the get_ecg_resnet_from_hf function within experiments/utils/models.py.
Model (PatchTST): The pre-trained PatchTST model adapted for this task is hosted on Hugging Face Hub: ➡️ enver1323/patchtst-classification-ecg. Similar to the dataset, the script automatically downloads the model weights. The model architecture is defined in experiments/models/ecg/patchtst.py. For more details on model loading, refer to the get_ecg_patchtst_from_hf function within experiments/utils/models.py.

Usage

python -m experiments.scripts.analyze_ecg_hpo \
    --model resnet_plus \
    --out_file results/hpo_analysis_ecg/explanations_summary.csv

Arguments

--model <MODEL>: The name of model (resnet_plus, patchtst) to analyze.
--out_file <FILENAME>: The name of the output file to store the explanation summary. The value defaults to results/hpo_analysis_ecg/explanations_summary.csv.

Output

Results will be saved to the file path specified in the FILENAME of --out_file argument.

Experiment 6: Wine Quality Explanation

This experiment compares multiple XAI frameworks (PnPXAI, Captum, OmniXAI, OpenXAI, AutoXAI) on the Wine Quality dataset using various models and explainer methods. It evaluates explanations using Faithfulness, Complexity, and their Composite score.

Data and Model

Data (Wine Quality): The Wine Quality dataset containing ~6,497 samples (white and red wine combined) for binary classification (good vs. bad quality).
Model (XGBoost & TabResNet):
- XGBoost: A gradient boosting classifier trained on the tabular features.
- TabResNet: A ResNet-like architecture adapted for tabular data. Pre-trained weights for both models are included in the data/wine_quality/ directory.

Setup for Wine Quality Experiment

Due to dependency conflicts between frameworks, this experiment requires a dedicated Docker environment separate from the main setup.

Please build the specific Docker image using the provided Dockerfile.wine_quality:

# Build the Wine Quality Docker image
docker build -t pnpxai_wine_quality:latest -f Dockerfile.wine_quality .

Usage

Run the container: Start the interactive container with GPU support and volume mounting.

docker run --rm -it \
    --runtime=nvidia \
    --gpus all \
    --shm-size=8g \
    -v $(pwd):/root/pnpxai-experiments \
    pnpxai_wine_quality:latest

Run the experiment: Inside the container, execute the analysis script:

python -m experiments.scripts.analyze_wine_quality \
    --n_samples 25 \
    --seed 42 \
    --verbose \
    --data_dir data/wine_quality \
    --config_dir experiments/configs/tabular \
    --results_dir results/wine_quality

Arguments

--n_samples <N>: Number of samples for sampling-based explainers (LIME/SHAP). Defaults to 25.
--seed <SEED>: Random seed for reproducibility. Defaults to 42.
--verbose: Enable detailed logging.
--data_dir <PATH>: Path to data directory. Defaults to data/wine_quality.
--config_dir <PATH>: Path to config directory. Defaults to experiments/configs/tabular.
--results_dir <PATH>: Path to results directory. Defaults to results/wine_quality.

Output

The experiment will generate the following in the results/wine_quality/ directory:

Individual explanations: Saved in results/wine_quality/{model}/{framework}/{explainer}/ as .npy files (explanations and metric scores).
Summary table: experiment_result.md containing a LaTeX table comparing Faithfulness, Complexity, and Composite scores across all frameworks.
Execution log: experiment.log (if verbose logging is enabled or configured).

Citation

Will be updated later.

License

This project's code is licensed under the MIT License.

The datasets used in the experiments are derived from existing benchmarks and are subject to their original licenses:

ImageNet: Subject to the ImageNet Terms of Access. The dataset is restricted to non-commercial research and educational purposes only. Users must obtain access via the official website and agree to the Terms of Access.
LiTS (Liver Tumor): CC-BY-NC-SA-4.0 License
MIMIC-III (AKI): Subject to the PhysioNet Credentialed Health Data License 1.5.0. Due to license restrictions, we do not distribute the data. Users must obtain access via PhysioNet and agree to the data use agreement.
ECG: Derived from the UCR Time Series Classification Archive. Free for research and educational use. (UCR Archive)
Wine Quality: CC BY 4.0 License

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
checkpoints		checkpoints
data		data
experiments		experiments
results		results
.gitignore		.gitignore
Dockerfile		Dockerfile
Dockerfile.wine_quality		Dockerfile.wine_quality
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

License

OpenXAIProject/pnpxai-experiments

Folders and files

Latest commit

History

Repository files navigation

PnPXAI: A Universal Framework for the Automated Generation of Objective Aligned and Reliable Explanations

Table of Contents

Setup

Option 1: Using Docker (Recommended) 🐳

Option 2: Manual Installation

Running Experiments

Experiment 1: ImageNet Explanation

Data and Model

Usage

Arguments

Output

Experiment 2: Hyperparameter Impact Analysis

Data and Model

Usage

Arguments

Note on batch_size

Output

Experiment 3: Liver Tumor Explanation

Data and Model

Usage

Arguments

Output

Experiment 4: Acute Kidney Injury (AKI) Explanation

Data and Model

Usage

Arguments

Output

Experiment 5: ECG Explanation

Data and Model

Usage

Arguments

Output

Experiment 6: Wine Quality Explanation

Data and Model

Setup for Wine Quality Experiment

Usage

Arguments

Output

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Note on `batch_size`

Packages