-
Notifications
You must be signed in to change notification settings - Fork 25
Add a support for an ML model and group-wise risks in the inference attack #48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a support for an ML model and group-wise risks in the inference attack #48
Conversation
…ator.py; Compute risks for every group within the target column; Add an example notebook.
|
Hi @itrajanovska, thanks a lot for opening the PR! I think that both the group-by risk and the possibility to pass an arbitrary ML model are good additions to the package. I have some high-level comments about the implementation, especially on how to pass a model to the The code makes quite some assumptions on what from typing import Protocol
class InferencePredictor(Protocol):
def predict(self, X: pd.DataFrame) -> pd.Series: # or more appropriate typing
...Then, wrap the kNN attack into a class conforming to this protocol. Another advantage of this is the the logic of whether to sample the targets or not can also be formalized in the protocol, e.g. by adding a boolean property that tells the code if sampling is needed: from typing import Protocol
class InferencePredictor(Protocol):
def predict(self, X: pd.DataFrame) -> pd.Series: # or more appropriate typing
...
@property
def sample_targets(self) -> bool:
...If you like this idea I can help you with the implementation. I won't comment for now on the code related to this change. |
| self._data_groups = self._ori[self._secret].unique().tolist() | ||
|
|
||
| def _attack(self, target: pd.DataFrame, naive: bool, n_jobs: int) -> int: | ||
| def _attack(self, target: pd.DataFrame, naive: bool, n_jobs: int, n_attacks: int) -> tuple[ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why you need to pass the n_attacks parameter if you have it defined it in the __init__?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because in case we use a ml model for the attacks and we set sample_targets=False,
the code will use the entire original and control dataframes to perform the attack. Since these can be different in size, I am explicitly passing the number of attacks here for each individual attack. I felt that this way we keep track with the different number of attacks per attack type and clearly differentiate between these.
I think this can also be implicitly handled given the new model feature sample_targets. Let me know what you think and I can change the logic behind the number of attacks.
| assert len(self.guesses_success) == len(self.target) | ||
| assert (self.guesses_success.index == self.target.index).all() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not a big fan of assertions in the code.. if you really need to verify these conditions you can do this, but then raise a more informative exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for raising this, I have removed it now, it was used for sanity checks during dev time.
|
I also like the idea of breaking down the risk for different groups in the data, but doing it complicates a bit the implementation.. do you think that it would be possible to do this "upstream", i.e. by defining this sort of analysis function? def grouped_inference_risk(
ori: pd.DataFrame,
syn: pd.DataFrame,
aux_cols: list[str],
secret: str,
regression: Optional[bool] = None,
n_attacks: int = 500,
control: Optional[pd.DataFrame] = None,
) -> dict[str, PrivacyRisk]:
out = {}
for value, target_group in ori.groupby(secret):
evaluator = InferenceEvaluator(
ori=target_group,
syn=syn,
aux_cols=aux_cols,
regression=regression,
n_attacks=n_attacks,
control=control,
secret=secret,
)
evaluator.evaluate()
out[value] = evaluator.risk()
return outAgain, I will refrain for commenting too much on the code until these high-level points are not cleared. |
Create class wrappers for a KNN and the ml model; Update example inference_custom_model_example.ipynb.
Thank you for this suggestion. |
Thank you for this suggestion. |
|
Hi @itrajanovska, thanks for incorporating some of my suggestions in the PR, I think it's going in the right direction, however I think that there is still some work to be done. If you want, we could set up a call and we can go over the PR together, it might be faster and more productive. Otherwise, just let me know, and we will continue async, no problem. What I would like is to:
|
Hi @MatteoGiomi , sounds good. How should we connect? |
cool! You can find me on linkedin and we take it from there. |
|
As discussed afk, this PR will be closed and split into two separate ones for the two features. |
Support an ML model with an implemented predict() for inference_evaluator.py and have an optional sampling (n_attacks)
Biggest changes:
in
_run_attackwe now pass an ml_model and sample_attacks.If an ml_model is passed, then we don't need to sample, as the process won't be as exhaustive as using a knn, thus we can use the whole dataset (this is configures by settling sample_attacks to True or False).
Subsequently, we keep track of n_attacks for the main and control attack, as both datasets may have different number of rows if they are not subsampled.
In InferenceEvaluator::init we do the following:
Thus n_attacks is still useful when we are not sampling from the attacks, and otherwise we do not change the behavior of the evaluator.
Expectation: The default anonymeter scenario of these parameters in
_run_attackshould not change the original behavior, and the tests should not be broken.Compute risks for every group within the target/secret column
In
risk_for_groups(self, confidence_level: float = 0.95)we iterate over the unique groups within the secret column and compute the risks based on the filtered dataframes (target, guesses, or target_control and guesses_control) which we keep track of.E.g
Expectation:
risk_for_groupsshould not change the original behavior, and is called/used only externally.Add an example notebook.
notebooks/inference_custom_model_example.ipynbis an example usage scenario of the new features