Validating quality of synthesized multi-column correlated data

Following our meeting I thought I'd document a few ways one could validate data quality:

## For single or multi-column correlated generated datasets

Compare with the results of the unanonymized equivalent Aircloaked data source. This can be done by generating subdivisions in the data and comparing what fraction of the values are in the respective subdivisions.

Example:
- The dataset is latitudes, longitudes, salary, and house colour
- You subdivide the datasets (both generated with explorer and real data) into latitude buckets of width X, latitude buckets of width Y, salary buckets of width Z, and colour, and compare that each such sub division holds the same fraction of the overall values (within a delta) between the two datasets

## Visual inspection of generated geo data

It turned out to be very useful to visually inspect the data quality of generated datasets. You can easily fool yourself into believing the data quality is appropriate if you only use some arbitrary abstract numerical metric. Fooling your eyes is altogether more difficult.

Generating a high quality two-dimensional latitude longitude dataset is quite trivial and you are likely to get very good results. Three-dimensionsal ones isn't all too bad either, but from experience once you add in more dimensions correlations quickly start suffering. This will be immediately obvious when visually inspecting geo-location data. I therefore encourage you to set up some geo rendering pipeline which renders locations as dots on a map. It will make correlation artifacts hard to ignore. The NYC taxi database is a good candidate for this (for example you wouldn't expect a vertical line of dots in the middle of the water...)

## Comparing distribution characteristics

The Accord library you have in place for determining characteristics of the numerical distributions could be used to generate characteristics of both the dataset generated using explorer and then raw data. These parameters could be compared allowing for a certain delta. This does require a pretty good understanding of what these parameters mean, and how much of a deviance can be allowed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Validating quality of synthesized multi-column correlated data #293

For single or multi-column correlated generated datasets

Visual inspection of generated geo data

Comparing distribution characteristics

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Validating quality of synthesized multi-column correlated data #293

Description

For single or multi-column correlated generated datasets

Visual inspection of generated geo data

Comparing distribution characteristics

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions