Skip to content

Statistical imbalance impact: "univariate_singling_out_queries" function can be dominated by high-cardinality columns, impacting attack diversity and evaluation. #44

@tcjordao

Description

@tcjordao

Problem Description
When using the univariate_singling_out_queries function, we've observed that the generated queries can be overwhelmingly dominated by a single numerical column if that column has a significantly higher relative cardinality of rare values (i.e., values appearing only once) compared to other columns in the synthetic dataset.

This behavior is particularly evident with the capital-gain column in the Adults dataset after it has been anonymized using Gaussian differential privacy. Due to its inherent statistical distribution (many unique/rare non-zero values), capital-gain tends to have a large number of values with a value_counts() of 1.

Even though rng.shuffle(queries) is applied before UniqueSinglingOutQueries().check_and_append is used, the probability of the check_and_append method collecting queries exclusively or predominantly from this single high-cardinality column remains very high. This happens because the UniqueSinglingOutQueries class only adds queries that successfully single out a record. If one column provides a disproportionately large pool of such effective queries, it will quickly fill the n_queries quota.

Impact
This dependency on the statistical distribution of rare values within individual numerical columns means that even with an algorithmically randomized method, the actual randomness of the selected queries is heavily skewed. In my specific case, this has led to the "main" singling-out attack evaluating only capital-gain queries, making it difficult to properly compare its effectiveness against the "baseline attack" (random attack), as the diversity of potential attack vectors from other columns is not explored. The attack results might not accurately reflect the overall privacy risk across the entire dataset's attributes.

Suggestion
To ensure a more representative and diverse set of queries for the univariate_singling_out_queries attack, it would be beneficial to implement a control mechanism that guarantees the inclusion of queries considering other columns.

Possible approaches could include:

Per-Column Query Limit: Implement a soft or hard limit on the maximum number of successful singling-out queries that can be collected from any single column.
Weighted Column Selection: Introduce a mechanism to prioritize or increase the probability of selecting queries from columns that have contributed fewer successful singling-out queries so far.
Stratified Sampling of Columns: Ensure that a diverse set of columns are sampled (or iterated through) when generating candidate queries, rather than relying solely on the natural abundance of rare values.

Implementing such a control would ensure that the evaluation of singling-out risk provides a broader assessment across various attributes, offering a more comprehensive understanding of the dataset's privacy posture.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions