This repository accompanies the paper “NEPAL: Climbing Toward the Peak of Re-Identification in Privacy-Preserving Record Linkage”, which introduces the Neural Pattern Learning (NEPAL) Attack.
It provides documentation and resources for reproducing the experiments and analyses presented in the paper.
The NEPAL Attack models a machine learning–based adversary that performs re-identification in Privacy-Preserving Record Linkage (PPRL) systems based on known plaintext–encoding pairs.
Unlike traditional Pattern Mining Attacks (PMAs) that rely on scheme-specific heuristics, NEPAL formulates pattern mining as a general learning problem. It uses neural networks to learn correlations between encoded records and their underlying plaintext structures, enabling large-scale, scheme-agnostic plaintext reconstruction.
The attack consists of two major stages:
- Pattern Mining – A neural model learns mappings between encodings and their constituent q-grams (substrings of the original identifiers). This is framed as a multi-label classification task.
- Plaintext Reconstruction – The predicted q-grams are assembled into complete identifiers using a graph-based reconstruction algorithm.
For a detailed description of the attacker model, theoretical background, and evaluation results, see the paper.
The simplest way to reproduce the NEPAL pipeline is to run the implementation repository inside Docker.
The following setup reproduces the environment used during paper preparation:
git clone <nepal-repository>
cd <nepal-repository>
git submodule update --init --recursive --remote
docker build -t nepal .
docker run --gpus all -it -v $(pwd):/usr/app nepal bashNote: GPU access is optional but strongly recommended for hyperparameter optimization. The repository will be mounted inside the container at /usr/app.
A default configuration is provided for the NEPAL attack. Once inside the container, execute:
python3 main.py --config nepal_config.jsonThis command launches the complete NEPAL pipeline, including:
- data preprocessing,
- neural model training, and
- plaintext reconstruction.
Results are written to the experiment_results directory. See docs/parameters.md for a detailed explanation of configuration options and schema.
The code expects a tab-separated file with one record per row. The fist row must be a header specifying the column names. Internally, the values stored in the columns are concatenated according to column ordering and normalized (switch to lowercase, remove whitespace and missing values). The last column must contain a unique ID.
If you have data in .csv, .xls or .xlsx format, you may run python preprocessing.py for convenient conversion.
The script will guide you through the process.
In the data directory, this repository already provides datasets which can be used.
To run multiple experiments or reproduce the experiments from the paper, use the experiment script:
python3 experiment_setup.pyThis script runs multiple configurations automatically to produce results as in the paper.
The analysis notebook analysis.ipnb reproduce the figures reported in the paper. Open the notebook and ensure that the output file from extract_nepal_results.py is generated correctly. extract_nepal_results.py uses the results produced in experiment_results
NEPAL reframes cryptanalysis of similarity-preserving encodings as a supervised learning task, enabling the model to learn directly from encoding–plaintext pairs and generalize across multiple encoding schemes.
(1) Pattern Mining using neural networks to predict constituent q-grams from encoded data, and (2) Plaintext Reconstruction assembling the predicted fragments into complete identifiers.
Experiments were conducted on eight datasets (including FakeName, Euro Person, and Titanic), across three encoding schemes: Bloom Filters (BF), Two-Step Hashing (TSH), and Tabulation MinHash (TMH).
- Achieved Dice coefficients up to 0.997 (indicating near-perfect q-gram reconstruction).
- Re-identified up to 33.05% of encoded records exactly.
- Demonstrated that TSH and BF are the most vulnerable encoding schemes, while TMH is more resilient.
Additional Information about Noisy Datasets, Parameters and Reproduction can be found in docs
If you use this repository or reproduce results from the NEPAL paper, please cite:
(TBD)
For questions or clarifications regarding the implementation or replication of experiments, please refer to the code repository or contact the paper authors.
This code is licensed under GPLv3