-
Notifications
You must be signed in to change notification settings - Fork 18
Description
Hey folks – wanted to get some clarification on the reliability of using SharedStreets Reference IDs as a stable identifier through time.
Each week we want to process GPS traces into speed profiles matched to the shared streets referencing system. In order to support complex filtering and aggregation across multiple years of data, these results need to be stored in a database (rather than static pbf files) with a schema like the following:
| ss_ref_id | datetime | speed_p85 |
|---|---|---|
| 45f4b95b62f28464caca1f76e48efcb3 | 2018-01-05 07:00:00 | 34 |
| 45f4b95b62f28464caca1f76e48efcb3 | 2018-01-05 08:00:00 | 32 |
| 45f4b95b62f28464caca1f76e48efcb3 | 2018-01-05 09:00:00 | 37 |
| ... | ... | ... |
*this schema may also contain Location Reference column(s) in practice
With each new week, we mapmatch data against the latest version of OSM. Since SharedStreets references IDs are essentially hashes against the underlying geospatial data with a tolerance of +/- 1.1m (as pointed out in sharedstreets/sharedstreets-js#16), any osm update that moved an intersection more than 1m will lead to new Reference IDs:
| ss_ref_id | datetime | speed_p85 |
|---|---|---|
| ... | ... | ... |
| 45f4b95b62f28464caca1f76e48efcb3 | 2018-01-07 23:00:00 | 42 |
| New week (osm updated) | ||
| 763c212d53f8b4ba4fce92e884988c9e | 2018-01-08 00:00:00 | 43 |
| 763c212d53f8b4ba4fce92e884988c9e | 2018-01-08 01:00:00 | 42 |
| ... | ... | ... |
These changing IDs would prevent us from being able to easily run aggregate queries over long periods of time – for example, to create a histogram of speeds on a segment over all of 2017. It's also ambiguous which version of tiles we would load in this scenario – if we pick tiles from the end of 2017, many of our reference IDs from earlier in the year will not match up.
From sharedstreets/sharedstreets-js#16, it sounds like these hash IDs were never meant to match up across datasets / basemap versions, and that instead fuzzy matching on the underlying geospatial data is the way to reconcile these things. However this requires a non-trivial amount of work and seems like a significant divergence from OSMLR which had tolerance levels of ~20m to make these identifiers more stable.
From #22, it sounds like there may be ways to subscribe to changing SS References in the future, but it could be cumbersome to continuously apply these migrations to historical datasets with billions of observations.
While it's not a panacea, it seems like generating IDs using a higher tolerance level for underlying geospatial changes would increase stability and the likelihood that datasets/tiles continue matching. Is there any reason the referencing system isn't designed this way?