Add a support for an ML model and group-wise risks in the inference attack #48

itrajanovska · 2025-11-12T09:42:11Z

Support an ML model with an implemented predict() for inference_evaluator.py and have an optional sampling (n_attacks)

Biggest changes:
in _run_attack we now pass an ml_model and sample_attacks.
If an ml_model is passed, then we don't need to sample, as the process won't be as exhaustive as using a knn, thus we can use the whole dataset (this is configures by settling sample_attacks to True or False).
Subsequently, we keep track of n_attacks for the main and control attack, as both datasets may have different number of rows if they are not subsampled.
In InferenceEvaluator::init we do the following:

        if not self._sample_attacks:
            self._n_attacks_ori = self._ori.shape[0]
            self._n_attacks_baseline = min(self._syn.shape[0], self._n_attacks_ori)
            self._n_attacks_control = self._control.shape[0]
        else:
            self._n_attacks_ori = self._n_attacks
            self._n_attacks_baseline = self._n_attacks
            self._n_attacks_control = self._n_attacks

Thus n_attacks is still useful when we are not sampling from the attacks, and otherwise we do not change the behavior of the evaluator.

Expectation: The default anonymeter scenario of these parameters in _run_attack should not change the original behavior, and the tests should not be broken.

Compute risks for every group within the target/secret column

In risk_for_groups(self, confidence_level: float = 0.95) we iterate over the unique groups within the secret column and compute the risks based on the filtered dataframes (target, guesses, or target_control and guesses_control) which we keep track of.
E.g

for group in self._data_groups:
    # Get the targets for the current group
    target = self.target[self.target[self._secret] == group

    # Get the guesses for the current group
    guess = self.guesses_success.loc[target.index]

Expectation: risk_for_groups should not change the original behavior, and is called/used only externally.

Add an example notebook.

notebooks/inference_custom_model_example.ipynb is an example usage scenario of the new features

…ator.py; Compute risks for every group within the target column; Add an example notebook.

CLAassistant · 2025-11-12T09:42:19Z

All committers have signed the CLA.

MatteoGiomi · 2025-11-18T10:22:12Z

Hi @itrajanovska, thanks a lot for opening the PR! I think that both the group-by risk and the possibility to pass an arbitrary ML model are good additions to the package.

I have some high-level comments about the implementation, especially on how to pass a model to the InferenceEvaluator.

The code makes quite some assumptions on what ml_model should do, so it would be better to enforce them by using a protocol, something like:

from typing import Protocol

class InferencePredictor(Protocol):
    def predict(self, X: pd.DataFrame) -> pd.Series: # or more appropriate typing
        ...

Then, wrap the kNN attack into a class conforming to this protocol. Another advantage of this is the the logic of whether to sample the targets or not can also be formalized in the protocol, e.g. by adding a boolean property that tells the code if sampling is needed:

from typing import Protocol

class InferencePredictor(Protocol):
    def predict(self, X: pd.DataFrame) -> pd.Series: # or more appropriate typing
        ...
    @property
    def sample_targets(self) -> bool:
        ...

If you like this idea I can help you with the implementation. I won't comment for now on the code related to this change.

MatteoGiomi · 2025-11-18T10:25:46Z

src/anonymeter/evaluators/inference_evaluator.py

+        self._data_groups = self._ori[self._secret].unique().tolist()

-    def _attack(self, target: pd.DataFrame, naive: bool, n_jobs: int) -> int:
+    def _attack(self, target: pd.DataFrame, naive: bool, n_jobs: int, n_attacks: int) -> tuple[


why you need to pass the n_attacks parameter if you have it defined it in the __init__?

This is because in case we use a ml model for the attacks and we set sample_targets=False,
the code will use the entire original and control dataframes to perform the attack. Since these can be different in size, I am explicitly passing the number of attacks here for each individual attack. I felt that this way we keep track with the different number of attacks per attack type and clearly differentiate between these.
I think this can also be implicitly handled given the new model feature sample_targets. Let me know what you think and I can change the logic behind the number of attacks.

MatteoGiomi · 2025-11-18T10:33:03Z

src/anonymeter/evaluators/inference_evaluator.py

+            assert len(self.guesses_success) == len(self.target)
+            assert (self.guesses_success.index == self.target.index).all()


I am not a big fan of assertions in the code.. if you really need to verify these conditions you can do this, but then raise a more informative exception.

Thanks for raising this, I have removed it now, it was used for sanity checks during dev time.

MatteoGiomi · 2025-11-18T10:44:16Z

I also like the idea of breaking down the risk for different groups in the data, but doing it complicates a bit the implementation.. do you think that it would be possible to do this "upstream", i.e. by defining this sort of analysis function?

def grouped_inference_risk(
    ori: pd.DataFrame,
    syn: pd.DataFrame,
    aux_cols: list[str],
    secret: str,
    regression: Optional[bool] = None,
    n_attacks: int = 500,
    control: Optional[pd.DataFrame] = None,
) -> dict[str, PrivacyRisk]:
    out = {}
    for value, target_group in ori.groupby(secret):
        evaluator = InferenceEvaluator(
            ori=target_group,
            syn=syn,
            aux_cols=aux_cols,
            regression=regression,
            n_attacks=n_attacks,
            control=control,
            secret=secret,
        )
        evaluator.evaluate()

        out[value] = evaluator.risk()
    return out

Again, I will refrain for commenting too much on the code until these high-level points are not cleared.

Create class wrappers for a KNN and the ml model; Update example inference_custom_model_example.ipynb.

itrajanovska · 2025-11-21T09:37:49Z

I also like the idea of breaking down the risk for different groups in the data, but doing it complicates a bit the implementation.. do you think that it would be possible to do this "upstream", i.e. by defining this sort of analysis function?

def grouped_inference_risk(
    ori: pd.DataFrame,
    syn: pd.DataFrame,
    aux_cols: list[str],
    secret: str,
    regression: Optional[bool] = None,
    n_attacks: int = 500,
    control: Optional[pd.DataFrame] = None,
) -> dict[str, PrivacyRisk]:
    out = {}
    for value, target_group in ori.groupby(secret):
        evaluator = InferenceEvaluator(
            ori=target_group,
            syn=syn,
            aux_cols=aux_cols,
            regression=regression,
            n_attacks=n_attacks,
            control=control,
            secret=secret,
        )
        evaluator.evaluate()

        out[value] = evaluator.risk()
    return out

Again, I will refrain for commenting too much on the code until these high-level points are not cleared.

Thank you for this suggestion.
I didn't consider this approach at the beginning to avoid any overhead from calling attack and evaluate in a loop.
I tried your suggested implementation, but in adult the feature fnlwgt has 24633 unique values. I measured the execution and we would need 547 minutes for this feature only.
With the current approach, by caching the results the code is executed in 0.85 minutes for this example column.

itrajanovska · 2025-11-21T09:38:32Z

Hi @itrajanovska, thanks a lot for opening the PR! I think that both the group-by risk and the possibility to pass an arbitrary ML model are good additions to the package.

I have some high-level comments about the implementation, especially on how to pass a model to the InferenceEvaluator.

The code makes quite some assumptions on what ml_model should do, so it would be better to enforce them by using a protocol, something like:
from typing import Protocol

class InferencePredictor(Protocol):
    def predict(self, X: pd.DataFrame) -> pd.Series: # or more appropriate typing
        ...
Then, wrap the kNN attack into a class conforming to this protocol. Another advantage of this is the the logic of whether to sample the targets or not can also be formalized in the protocol, e.g. by adding a boolean property that tells the code if sampling is needed:
from typing import Protocol

class InferencePredictor(Protocol):
    def predict(self, X: pd.DataFrame) -> pd.Series: # or more appropriate typing
        ...
    @property
    def sample_targets(self) -> bool:
        ...
If you like this idea I can help you with the implementation. I won't comment for now on the code related to this change.

Thank you for this suggestion.
I followed the example you proposed and I agree, the code is safer and cleaner this way.

MatteoGiomi · 2025-11-24T10:24:20Z

Hi @itrajanovska, thanks for incorporating some of my suggestions in the PR, I think it's going in the right direction, however I think that there is still some work to be done. If you want, we could set up a call and we can go over the PR together, it might be faster and more productive. Otherwise, just let me know, and we will continue async, no problem.

What I would like is to:

Figure out a way to abstract the sample/not sample logic to the predictor so that the Evaluator does not need to worry about this at all.
Find a cleaner way to cache analysis results for the grouped-risk analysis which is no so buried deep in the evaluator code (to be potentially extended to other evaluators as well).
Add tests :-)

itrajanovska · 2025-11-24T10:40:23Z

Hi @itrajanovska, thanks for incorporating some of my suggestions in the PR, I think it's going in the right direction, however I think that there is still some work to be done. If you want, we could set up a call and we can go over the PR together, it might be faster and more productive. Otherwise, just let me know, and we will continue async, no problem.

What I would like is to:
* Figure out a way to abstract the sample/not sample logic to the predictor so that the Evaluator does not need to worry about this at all.

* Find a cleaner way to cache analysis results for the grouped-risk analysis which is no so buried deep in the evaluator code (to be potentially extended to other evaluators as well).

* Add tests :-)

Hi @MatteoGiomi , sounds good.
Let's arrange a call whenever it suits you best and go over the changes together.

How should we connect?

MatteoGiomi · 2025-11-24T11:01:04Z

Hi @MatteoGiomi , sounds good. Let's arrange a call whenever it suits you best and go over the changes together.
How should we connect?

cool! You can find me on linkedin and we take it from there.

MatteoGiomi · 2025-12-04T14:35:20Z

As discussed afk, this PR will be closed and split into two separate ones for the two features.

Support an ML model with an implemented predict() for inference_evalu…

03c1c49

…ator.py; Compute risks for every group within the target column; Add an example notebook.

itrajanovska mentioned this pull request Nov 12, 2025

Additional features to the inference attack #47

Open

Ruff - Tuple.

92fae1a

MatteoGiomi reviewed Nov 18, 2025

View reviewed changes

Add an InferencePredictor Protocol;

22af71a

Create class wrappers for a KNN and the ml model; Update example inference_custom_model_example.ipynb.

MatteoGiomi closed this Dec 4, 2025

MatteoGiomi mentioned this pull request Dec 4, 2025

Add custom inference model #51

Merged

itrajanovska mentioned this pull request Dec 6, 2025

Add group wise inference risks #52

Closed

itrajanovska mentioned this pull request Dec 16, 2025

Add group-wise inference risks #53

Open

		assert len(self.guesses_success) == len(self.target)
		assert (self.guesses_success.index == self.target.index).all()

Add a support for an ML model and group-wise risks in the inference attack #48

Add a support for an ML model and group-wise risks in the inference attack #48

Uh oh!

Conversation

itrajanovska commented Nov 12, 2025

Support an ML model with an implemented predict() for inference_evaluator.py and have an optional sampling (n_attacks)

Compute risks for every group within the target/secret column

Add an example notebook.

Uh oh!

CLAassistant commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MatteoGiomi commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MatteoGiomi Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

itrajanovska Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MatteoGiomi Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

itrajanovska Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

MatteoGiomi commented Nov 18, 2025

Uh oh!

itrajanovska commented Nov 21, 2025

Uh oh!

itrajanovska commented Nov 21, 2025

Uh oh!

MatteoGiomi commented Nov 24, 2025

Uh oh!

itrajanovska commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MatteoGiomi commented Nov 24, 2025

Uh oh!

MatteoGiomi commented Dec 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

CLAassistant commented Nov 12, 2025 •

edited

Loading

MatteoGiomi commented Nov 18, 2025 •

edited

Loading

itrajanovska Nov 21, 2025 •

edited

Loading

itrajanovska commented Nov 24, 2025 •

edited

Loading