Skip to content

Conversation

@itrajanovska
Copy link
Contributor

Support an ML model with an implemented predict() for inference_evaluator.py and have an optional sampling (n_attacks)

Biggest changes:
in _run_attack we now pass an ml_model and sample_attacks.
If an ml_model is passed, then we don't need to sample, as the process won't be as exhaustive as using a knn, thus we can use the whole dataset (this is configures by settling sample_attacks to True or False).
Subsequently, we keep track of n_attacks for the main and control attack, as both datasets may have different number of rows if they are not subsampled.
In InferenceEvaluator::init we do the following:

        if not self._sample_attacks:
            self._n_attacks_ori = self._ori.shape[0]
            self._n_attacks_baseline = min(self._syn.shape[0], self._n_attacks_ori)
            self._n_attacks_control = self._control.shape[0]
        else:
            self._n_attacks_ori = self._n_attacks
            self._n_attacks_baseline = self._n_attacks
            self._n_attacks_control = self._n_attacks

Thus n_attacks is still useful when we are not sampling from the attacks, and otherwise we do not change the behavior of the evaluator.

Expectation: The default anonymeter scenario of these parameters in _run_attack should not change the original behavior, and the tests should not be broken.

Compute risks for every group within the target/secret column

In risk_for_groups(self, confidence_level: float = 0.95) we iterate over the unique groups within the secret column and compute the risks based on the filtered dataframes (target, guesses, or target_control and guesses_control) which we keep track of.
E.g

for group in self._data_groups:
    # Get the targets for the current group
    target = self.target[self.target[self._secret] == group

    # Get the guesses for the current group
    guess = self.guesses_success.loc[target.index]

Expectation: risk_for_groups should not change the original behavior, and is called/used only externally.

Add an example notebook.

notebooks/inference_custom_model_example.ipynb is an example usage scenario of the new features

…ator.py;

Compute risks for every group within the target column;
Add an example notebook.
@CLAassistant
Copy link

CLAassistant commented Nov 12, 2025

CLA assistant check
All committers have signed the CLA.

@MatteoGiomi
Copy link
Member

MatteoGiomi commented Nov 18, 2025

Hi @itrajanovska, thanks a lot for opening the PR! I think that both the group-by risk and the possibility to pass an arbitrary ML model are good additions to the package.

I have some high-level comments about the implementation, especially on how to pass a model to the InferenceEvaluator.

The code makes quite some assumptions on what ml_model should do, so it would be better to enforce them by using a protocol, something like:

from typing import Protocol

class InferencePredictor(Protocol):
    def predict(self, X: pd.DataFrame) -> pd.Series: # or more appropriate typing
        ...

Then, wrap the kNN attack into a class conforming to this protocol. Another advantage of this is the the logic of whether to sample the targets or not can also be formalized in the protocol, e.g. by adding a boolean property that tells the code if sampling is needed:

from typing import Protocol

class InferencePredictor(Protocol):
    def predict(self, X: pd.DataFrame) -> pd.Series: # or more appropriate typing
        ...
    @property
    def sample_targets(self) -> bool:
        ...

If you like this idea I can help you with the implementation. I won't comment for now on the code related to this change.

self._data_groups = self._ori[self._secret].unique().tolist()

def _attack(self, target: pd.DataFrame, naive: bool, n_jobs: int) -> int:
def _attack(self, target: pd.DataFrame, naive: bool, n_jobs: int, n_attacks: int) -> tuple[
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why you need to pass the n_attacks parameter if you have it defined it in the __init__?

Copy link
Contributor Author

@itrajanovska itrajanovska Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because in case we use a ml model for the attacks and we set sample_targets=False,
the code will use the entire original and control dataframes to perform the attack. Since these can be different in size, I am explicitly passing the number of attacks here for each individual attack. I felt that this way we keep track with the different number of attacks per attack type and clearly differentiate between these.
I think this can also be implicitly handled given the new model feature sample_targets. Let me know what you think and I can change the logic behind the number of attacks.

Comment on lines 338 to 339
assert len(self.guesses_success) == len(self.target)
assert (self.guesses_success.index == self.target.index).all()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not a big fan of assertions in the code.. if you really need to verify these conditions you can do this, but then raise a more informative exception.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for raising this, I have removed it now, it was used for sanity checks during dev time.

@MatteoGiomi
Copy link
Member

I also like the idea of breaking down the risk for different groups in the data, but doing it complicates a bit the implementation.. do you think that it would be possible to do this "upstream", i.e. by defining this sort of analysis function?

def grouped_inference_risk(
    ori: pd.DataFrame,
    syn: pd.DataFrame,
    aux_cols: list[str],
    secret: str,
    regression: Optional[bool] = None,
    n_attacks: int = 500,
    control: Optional[pd.DataFrame] = None,
) -> dict[str, PrivacyRisk]:
    out = {}
    for value, target_group in ori.groupby(secret):
        evaluator = InferenceEvaluator(
            ori=target_group,
            syn=syn,
            aux_cols=aux_cols,
            regression=regression,
            n_attacks=n_attacks,
            control=control,
            secret=secret,
        )
        evaluator.evaluate()

        out[value] = evaluator.risk()
    return out

Again, I will refrain for commenting too much on the code until these high-level points are not cleared.

Create class wrappers for a KNN and the ml model;
Update example inference_custom_model_example.ipynb.
@itrajanovska
Copy link
Contributor Author

I also like the idea of breaking down the risk for different groups in the data, but doing it complicates a bit the implementation.. do you think that it would be possible to do this "upstream", i.e. by defining this sort of analysis function?

def grouped_inference_risk(
    ori: pd.DataFrame,
    syn: pd.DataFrame,
    aux_cols: list[str],
    secret: str,
    regression: Optional[bool] = None,
    n_attacks: int = 500,
    control: Optional[pd.DataFrame] = None,
) -> dict[str, PrivacyRisk]:
    out = {}
    for value, target_group in ori.groupby(secret):
        evaluator = InferenceEvaluator(
            ori=target_group,
            syn=syn,
            aux_cols=aux_cols,
            regression=regression,
            n_attacks=n_attacks,
            control=control,
            secret=secret,
        )
        evaluator.evaluate()

        out[value] = evaluator.risk()
    return out

Again, I will refrain for commenting too much on the code until these high-level points are not cleared.

Thank you for this suggestion.
I didn't consider this approach at the beginning to avoid any overhead from calling attack and evaluate in a loop.
I tried your suggested implementation, but in adult the feature fnlwgt has 24633 unique values. I measured the execution and we would need 547 minutes for this feature only.
With the current approach, by caching the results the code is executed in 0.85 minutes for this example column.

@itrajanovska
Copy link
Contributor Author

Hi @itrajanovska, thanks a lot for opening the PR! I think that both the group-by risk and the possibility to pass an arbitrary ML model are good additions to the package.

I have some high-level comments about the implementation, especially on how to pass a model to the InferenceEvaluator.

The code makes quite some assumptions on what ml_model should do, so it would be better to enforce them by using a protocol, something like:

from typing import Protocol

class InferencePredictor(Protocol):
    def predict(self, X: pd.DataFrame) -> pd.Series: # or more appropriate typing
        ...

Then, wrap the kNN attack into a class conforming to this protocol. Another advantage of this is the the logic of whether to sample the targets or not can also be formalized in the protocol, e.g. by adding a boolean property that tells the code if sampling is needed:

from typing import Protocol

class InferencePredictor(Protocol):
    def predict(self, X: pd.DataFrame) -> pd.Series: # or more appropriate typing
        ...
    @property
    def sample_targets(self) -> bool:
        ...

If you like this idea I can help you with the implementation. I won't comment for now on the code related to this change.

Thank you for this suggestion.
I followed the example you proposed and I agree, the code is safer and cleaner this way.

@MatteoGiomi
Copy link
Member

Hi @itrajanovska, thanks for incorporating some of my suggestions in the PR, I think it's going in the right direction, however I think that there is still some work to be done. If you want, we could set up a call and we can go over the PR together, it might be faster and more productive. Otherwise, just let me know, and we will continue async, no problem.

What I would like is to:

  • Figure out a way to abstract the sample/not sample logic to the predictor so that the Evaluator does not need to worry about this at all.
  • Find a cleaner way to cache analysis results for the grouped-risk analysis which is no so buried deep in the evaluator code (to be potentially extended to other evaluators as well).
  • Add tests :-)

@itrajanovska
Copy link
Contributor Author

itrajanovska commented Nov 24, 2025

Hi @itrajanovska, thanks for incorporating some of my suggestions in the PR, I think it's going in the right direction, however I think that there is still some work to be done. If you want, we could set up a call and we can go over the PR together, it might be faster and more productive. Otherwise, just let me know, and we will continue async, no problem.

What I would like is to:

* Figure out a way to abstract the sample/not sample logic to the predictor so that the Evaluator does not need to worry about this at all.

* Find a cleaner way to cache analysis results for the grouped-risk analysis which is no so buried deep in the evaluator code (to be potentially extended to other evaluators as well).

* Add tests :-)

Hi @MatteoGiomi , sounds good.
Let's arrange a call whenever it suits you best and go over the changes together.

How should we connect?

@MatteoGiomi
Copy link
Member

Hi @MatteoGiomi , sounds good. Let's arrange a call whenever it suits you best and go over the changes together.
How should we connect?

cool! You can find me on linkedin and we take it from there.

@MatteoGiomi
Copy link
Member

As discussed afk, this PR will be closed and split into two separate ones for the two features.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants