Skip to content

Using reference IDs as stable identifiers in historical time series data #23

@schnerd

Description

@schnerd

Hey folks – wanted to get some clarification on the reliability of using SharedStreets Reference IDs as a stable identifier through time.

Each week we want to process GPS traces into speed profiles matched to the shared streets referencing system. In order to support complex filtering and aggregation across multiple years of data, these results need to be stored in a database (rather than static pbf files) with a schema like the following:

ss_ref_id datetime speed_p85
45f4b95b62f28464caca1f76e48efcb3 2018-01-05 07:00:00 34
45f4b95b62f28464caca1f76e48efcb3 2018-01-05 08:00:00 32
45f4b95b62f28464caca1f76e48efcb3 2018-01-05 09:00:00 37
... ... ...
*this schema may also contain Location Reference column(s) in practice

With each new week, we mapmatch data against the latest version of OSM. Since SharedStreets references IDs are essentially hashes against the underlying geospatial data with a tolerance of +/- 1.1m (as pointed out in sharedstreets/sharedstreets-js#16), any osm update that moved an intersection more than 1m will lead to new Reference IDs:

ss_ref_id datetime speed_p85
... ... ...
45f4b95b62f28464caca1f76e48efcb3 2018-01-07 23:00:00 42
New week (osm updated)
763c212d53f8b4ba4fce92e884988c9e 2018-01-08 00:00:00 43
763c212d53f8b4ba4fce92e884988c9e 2018-01-08 01:00:00 42
... ... ...

These changing IDs would prevent us from being able to easily run aggregate queries over long periods of time – for example, to create a histogram of speeds on a segment over all of 2017. It's also ambiguous which version of tiles we would load in this scenario – if we pick tiles from the end of 2017, many of our reference IDs from earlier in the year will not match up.

From sharedstreets/sharedstreets-js#16, it sounds like these hash IDs were never meant to match up across datasets / basemap versions, and that instead fuzzy matching on the underlying geospatial data is the way to reconcile these things. However this requires a non-trivial amount of work and seems like a significant divergence from OSMLR which had tolerance levels of ~20m to make these identifiers more stable.

From #22, it sounds like there may be ways to subscribe to changing SS References in the future, but it could be cumbersome to continuously apply these migrations to historical datasets with billions of observations.

While it's not a panacea, it seems like generating IDs using a higher tolerance level for underlying geospatial changes would increase stability and the likelihood that datasets/tiles continue matching. Is there any reason the referencing system isn't designed this way?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions