Using reference IDs as stable identifiers in historical time series data

Hey folks – wanted to get some clarification on the reliability of using SharedStreets Reference IDs as a stable identifier through time.

Each week we want to process GPS traces into speed profiles matched to the shared streets referencing system. In order to support complex filtering and aggregation across multiple years of data, these results need to be stored in a database (rather than static pbf files) with a schema like the following:

ss_ref_id | datetime | speed_p85
:--- | :--- | :---
45f4b95b62f28464caca1f76e48efcb3 | 2018-01-05 07:00:00 | 34
45f4b95b62f28464caca1f76e48efcb3 | 2018-01-05 08:00:00 | 32
45f4b95b62f28464caca1f76e48efcb3 | 2018-01-05 09:00:00 | 37
... | ... | ...
##### _*this schema may also contain Location Reference column(s) in practice_

With each new week, we mapmatch data against the latest version of OSM. Since SharedStreets references IDs are essentially hashes against the underlying geospatial data with a tolerance of +/- 1.1m (as pointed out in https://github.com/sharedstreets/sharedstreets-js/issues/16), any osm update that moved an intersection more than 1m will lead to new Reference IDs:

ss_ref_id | datetime | speed_p85
:--- | :--- | :---
... | ... | ...
45f4b95b62f28464caca1f76e48efcb3 | 2018-01-07 23:00:00 | 42
_New week (osm updated)_ |  | 
763c212d53f8b4ba4fce92e884988c9e | 2018-01-08 00:00:00 | 43
763c212d53f8b4ba4fce92e884988c9e | 2018-01-08 01:00:00 | 42
... | ... | ...

These changing IDs would prevent us from being able to easily run aggregate queries over long periods of time – for example, to create a histogram of speeds on a segment over all of 2017. It's also ambiguous which version of tiles we would load in this scenario – if we pick tiles from the end of 2017, many of our reference IDs from earlier in the year will not match up.

From https://github.com/sharedstreets/sharedstreets-js/issues/16, it sounds like these hash IDs were never meant to match up across datasets / basemap versions, and that instead fuzzy matching on the underlying geospatial data is the way to reconcile these things. However this requires a non-trivial amount of work and seems like a significant divergence from OSMLR which had [tolerance levels of ~20m to make these identifiers more stable](https://github.com/opentraffic/osmlr/blob/master/docs/osmlr_updates.md).

From #22, it sounds like there may be ways to subscribe to changing SS References in the future, but it could be cumbersome to continuously apply these migrations to historical datasets with billions of observations.

While it's not a panacea, it seems like generating IDs using a higher tolerance level for underlying geospatial changes would increase stability and the likelihood that datasets/tiles continue matching. Is there any reason the referencing system isn't designed this way?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Using reference IDs as stable identifiers in historical time series data #23

*this schema may also contain Location Reference column(s) in practice

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ss_ref_id	datetime	speed_p85
45f4b95b62f28464caca1f76e48efcb3	2018-01-05 07:00:00	34
45f4b95b62f28464caca1f76e48efcb3	2018-01-05 08:00:00	32
45f4b95b62f28464caca1f76e48efcb3	2018-01-05 09:00:00	37
...	...	...

ss_ref_id	datetime	speed_p85
...	...	...
45f4b95b62f28464caca1f76e48efcb3	2018-01-07 23:00:00	42
New week (osm updated)
763c212d53f8b4ba4fce92e884988c9e	2018-01-08 00:00:00	43
763c212d53f8b4ba4fce92e884988c9e	2018-01-08 01:00:00	42
...	...	...

Using reference IDs as stable identifiers in historical time series data #23

Description

*this schema may also contain Location Reference column(s) in practice

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions