Skip to content

Validating quality of synthesized multi-column correlated data #293

@sebastian

Description

@sebastian

Following our meeting I thought I'd document a few ways one could validate data quality:

For single or multi-column correlated generated datasets

Compare with the results of the unanonymized equivalent Aircloaked data source. This can be done by generating subdivisions in the data and comparing what fraction of the values are in the respective subdivisions.

Example:

  • The dataset is latitudes, longitudes, salary, and house colour
  • You subdivide the datasets (both generated with explorer and real data) into latitude buckets of width X, latitude buckets of width Y, salary buckets of width Z, and colour, and compare that each such sub division holds the same fraction of the overall values (within a delta) between the two datasets

Visual inspection of generated geo data

It turned out to be very useful to visually inspect the data quality of generated datasets. You can easily fool yourself into believing the data quality is appropriate if you only use some arbitrary abstract numerical metric. Fooling your eyes is altogether more difficult.

Generating a high quality two-dimensional latitude longitude dataset is quite trivial and you are likely to get very good results. Three-dimensionsal ones isn't all too bad either, but from experience once you add in more dimensions correlations quickly start suffering. This will be immediately obvious when visually inspecting geo-location data. I therefore encourage you to set up some geo rendering pipeline which renders locations as dots on a map. It will make correlation artifacts hard to ignore. The NYC taxi database is a good candidate for this (for example you wouldn't expect a vertical line of dots in the middle of the water...)

Comparing distribution characteristics

The Accord library you have in place for determining characteristics of the numerical distributions could be used to generate characteristics of both the dataset generated using explorer and then raw data. These parameters could be compared allowing for a certain delta. This does require a pretty good understanding of what these parameters mean, and how much of a deviance can be allowed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions