-
Notifications
You must be signed in to change notification settings - Fork 1
Description
Following our meeting I thought I'd document a few ways one could validate data quality:
For single or multi-column correlated generated datasets
Compare with the results of the unanonymized equivalent Aircloaked data source. This can be done by generating subdivisions in the data and comparing what fraction of the values are in the respective subdivisions.
Example:
- The dataset is latitudes, longitudes, salary, and house colour
- You subdivide the datasets (both generated with explorer and real data) into latitude buckets of width X, latitude buckets of width Y, salary buckets of width Z, and colour, and compare that each such sub division holds the same fraction of the overall values (within a delta) between the two datasets
Visual inspection of generated geo data
It turned out to be very useful to visually inspect the data quality of generated datasets. You can easily fool yourself into believing the data quality is appropriate if you only use some arbitrary abstract numerical metric. Fooling your eyes is altogether more difficult.
Generating a high quality two-dimensional latitude longitude dataset is quite trivial and you are likely to get very good results. Three-dimensionsal ones isn't all too bad either, but from experience once you add in more dimensions correlations quickly start suffering. This will be immediately obvious when visually inspecting geo-location data. I therefore encourage you to set up some geo rendering pipeline which renders locations as dots on a map. It will make correlation artifacts hard to ignore. The NYC taxi database is a good candidate for this (for example you wouldn't expect a vertical line of dots in the middle of the water...)
Comparing distribution characteristics
The Accord library you have in place for determining characteristics of the numerical distributions could be used to generate characteristics of both the dataset generated using explorer and then raw data. These parameters could be compared allowing for a certain delta. This does require a pretty good understanding of what these parameters mean, and how much of a deviance can be allowed.