From 17936650988fcf5557bd9efec7464593dec58f7a Mon Sep 17 00:00:00 2001 From: Sarah Hoffmann Date: Fri, 10 Oct 2025 14:00:12 +0200 Subject: [PATCH 1/6] typo fix --- docs/user_manual/07-Input-Formats-And-Other-Sources.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/user_manual/07-Input-Formats-And-Other-Sources.md b/docs/user_manual/07-Input-Formats-And-Other-Sources.md index bfcce0f..0e96e6f 100644 --- a/docs/user_manual/07-Input-Formats-And-Other-Sources.md +++ b/docs/user_manual/07-Input-Formats-And-Other-Sources.md @@ -55,7 +55,7 @@ The special file name `-` can be used to read from standard input or write to standard output. When reading data, use a `File` object to specify the file format. With -the SimpleReader, you need to use the parameter `filetype`. +the SimpleWriter, you need to use the parameter `filetype`. !!! example This code snipped dumps all ids of your input file to the console. From fabc07765fde8f4819dd97cd4ca4ace39e45f775 Mon Sep 17 00:00:00 2001 From: Sarah Hoffmann Date: Mon, 20 Oct 2025 16:50:05 +0200 Subject: [PATCH 2/6] docs: change file chapter --- .../08-Working-With-Change-Files.md | 202 +++++++++++++++++- 1 file changed, 201 insertions(+), 1 deletion(-) diff --git a/docs/user_manual/08-Working-With-Change-Files.md b/docs/user_manual/08-Working-With-Change-Files.md index e63e4cc..baf973d 100644 --- a/docs/user_manual/08-Working-With-Change-Files.md +++ b/docs/user_manual/08-Working-With-Change-Files.md @@ -1,3 +1,203 @@ # Working With Change Files -OpenStreetMap produces two kinds of data, full data files and diff files with updates. This chapter explains how to handle diff files. +OpenStreetMap is a database that is constantly changing. Data is added, +improved and refined by the thousands of contributors around the world. +A standard OpenStreetMap data file, like the "planet file" only represents +a _snapshot_ of the OpenStreetMap database: the state of the world at a given +point in time. For many applications it is completely sufficient to work +with such a snapshot. To get the latest version of the data, they can +simply download another snapshot and process the data from scratch. +For some applications, however, it is necessary to keep their view of +OpenStreetMap always up-to-date with the latest version of the data. +This is where _OSM change files_ (also known as _OSM diff files_) +come into play. + +## OSM change files + +OpenStreetMap is a database with immutable data: objects are never directly edited. +Whenever you change an object, a new version of the object is created. In a similar +vain, it is not really possible to delete an object. You can only create +a new version of the object that is marked as being invisible. + +So, technically speaking an OSM database knows only one operation: adding a +new version of an object. Change files are a collection of these additions to +the main database over a given time period. + +When it comes to the file format and content an OSM change file does not +differ greatly from a normal OSM file: a change file is a collection of nodes, +ways and relations. There are just some meta attributes of the objects that +are more relevant than before: the **version** number and the **deleted** flag. + +A single modification may describe one of three operations: + +* **Creation.** A new node, way or relation is added. It has always the + version number 1. +* **Modification.** An existing object was directly modified: tags of the object, + the location of a node, the node list of a way or the member list of a + relation have been changed. A modified object has a version number larger + than 1 and the deleted flag is not set. +* **Deletion.** The object has been marked as deleted. The deleted flag is + set to true. Delete objects are usually not included in normal snapshot files. + This is an important difference when working with change files. + +There is no explicit _undelete_ operation. To undo a deletion, one can simply +_modify_ the object again and create a new version of the object with the +delete flag set to false. + +!!! danger + Undoing a deletion is an action that regularly happens in the + OpenStreetMap database. It can, for example, happen when the edit of a + user gets reverted because it was bogus. + Always be prepared for an object to reappear. + +## About replication services + +There are various ways to produce an OSM change file but the by far most +important source of change files are _replication services_. These services +publish in regular intervals all changes to the data that have happened in +the area of interest. + +Change files from replication services are +consecutively numbered. Each change is guaranteed to start exactly where +the previous change ended, so that by applying one change after another to +an existing planet (or extract), you can get a new complete snapshot +that corresponds to a newer version of the OSM database. + +Each change file is accompanied by a state file which has information about +the time of the change file. This can be used to find the right synchronisation +point when working with OSM files from different sources. + +## Full vs. simplified change files + +A change file that comes from a replication service usually contains the +full set of changes for a given time span. This means in particular that +there may be multiple versions of the same object when it was changed +multiple times in a very short time. In OSM, these kind of change files +are referred to as being _full_. + +There are many use cases, where these intermediate versions +are not really of interest. When updating a planet file, only the latest +version of an object will remain. For these use cases, change files may be +_simplified_. A simplified change file only keeps the latest version of +each object. Programs like `pyosmium-get-changes` can produce either version +of a change file. Usually, you want to work with simplified change files +unless you are interested in the exact history of changes to OSM. + + +## Referential integrity of change files + +We have previously discussed that the OSM data format is a topological format: +ways contain references to nodes and relations can contain references to any +kind of OSM object. This has two important implications for change files: + +* __Referential incompleteness__ + When a way or relation is changed, then only the way or relation + itself will be included in a normal change file. This means, for example, that + you will usually not be able to reconstruct the geometry of a changed way + by looking at a change file alone. The necessary information about the + location of the way is saved in its nodes and these nodes will not be + present in the change file if they haven't been changed. You cannot + even rely on anything when a way or relation has been newly created: the + nodes or members the new object references may have been created + many months ago. + +* __Indirect modifications__ + When a node that is part of a way is moved to a different location, then the + geometry of the way is changed. The OSM change file, however, will only contain + the new version of the node with the new location. The way itself has not + been changed: it still refers to the same list of nodes. Therefore the way + does not appear in the change file even though it might need to be updated + in your data. + +The reminder of this section discusses how references can be resolved when +working with change files. + +## Strategies for resolving forward and backward references for change files + +A change file can only ever fully interpreted in conjunction with a snapshot +of the planet that corresponds to the time of the first change in the +change file. Keeping such a planet snapshot is unfortunately not an easy +task because of the sheer size of the full planet data. However, there are +some shortcuts available depending on what kind of data you are interested in. + +### Following changes on nodes + +If you are only working with OSM nodes, no special provision are necessary. +Every change in a node will make it appear in the change file. + +### Following changes on ways + +Ways reference nodes. To derive the node geometry of a changed way, you need +to keep track of the location of each node. This process is very similar to +the process of tracking nodes when +[creating way geometries](03-Working-with-Geometries.md/#line-geometries). +You need to add a location cache when processing the change file. The +main difference is that the location cache needs to made persistent in a file +and that it needs to be pre-filled from the locations in your reference +planet. You can use the location storage type `sparse_file_array` to +create a persistent file which can be updated. Be aware that such a file +is around 100GB in size these days. Populate the file by running with +your planet file (here called `planet.osm.pbf`) as follows: + +```python +import osmium + +with osmium.io.Reader("planet.osm.pbf, osmium.osm.osm_entity_bits.NODE) as reader: + idx = osmium.index.create_map("sparse_file_array,nodecache.data") + + osmium.apply(reader, osmium.NodeLocationsForWays(idx)) +``` + +After this has run the file `nodecache.data` will contain location of all nodes +that were found in the input file. Subsequently you can use the node cache +with your change file just as described in the Geometry chapter: + +```python +import osmium + +for obj in osmium.FileProcessor("mychange.osm.xml")\ + .with_locations("sparse_file_array,nodecache.data"): + if obj.is_way(): + coords = ", ".join((f"{n.lon} {n.lat}" for n in o.nodes if n.location.valid())) + print(f"Way {o.id}: LINESTRING({coords})") +``` + +Note that this piece of code not only uses the locations from the cache file +but it also __updates__ the cache file with the locations from your change file. +This is usually what you want. You can process subsequent change file and +always have the reference to the corresponding locations. + + +!!! tip + In theory there are no restrictions to which nodes may be references by + a way. Thus, in theory, you always need to keep the + full set of node location that exist in OSM. In practise, edits mostly + happen in a confined area. Therefore, when your geographic area of + interest is limited, it is sufficient to only keep the node locations + that fall within that area. Just add a large enough buffer zone to + account for nodes being moved around. + +## Keeping a full planet snapshot + +After a full planet or an extract has been downloaded, it first needs to be +brought in sync with the replication source of change files you are using. +This is easiest done with the +[pyosmium-up-to-date](10-Replication-Tools.md/#updating-a-planet-or-extract) tool. +It takes an OSM file and a replication source and creates a new OSM file +which is perfectly synchronised with the replication source. It will also +tell you what the sequence number is of the next file to download. Use this +sequence number to get the next change file. You can either download a +single change file or use +[pyosmium-get-changes](10-Replication-Tools.md/#creating-change-files-for-updating-databases) to download multiple change files at once and combine them. + +Now that you have the change file, you can use the synchronised planet to +look up any missing OSM objects referenced in the file. Once you are done with +processing, you need to merge the synchronised planet with the change file, +for example using +[osmium apply-changes](https://docs.osmcode.org/osmium/latest/osmium-apply-changes.html). +After that you have a new synchronised planet ready for processing with the +next change file. + +If you regularly work with change data, have a look at +[osm2pgsql](https://osm2pgsql.org/). This is an application that stores OSM +data in a PostgreSQL database and also handles updates and change files. From 9887c969ce238c01bb1bf45f44123c9e834fb5d0 Mon Sep 17 00:00:00 2001 From: Sarah Hoffmann Date: Tue, 16 Dec 2025 20:17:12 +0100 Subject: [PATCH 3/6] add short introduction to history files --- .../09-Working-With-History-Files.md | 56 ++++++++++++++++++- 1 file changed, 55 insertions(+), 1 deletion(-) diff --git a/docs/user_manual/09-Working-With-History-Files.md b/docs/user_manual/09-Working-With-History-Files.md index aca16aa..51c978a 100644 --- a/docs/user_manual/09-Working-With-History-Files.md +++ b/docs/user_manual/09-Working-With-History-Files.md @@ -1,3 +1,57 @@ # Working With History Files -An OSM data file usually contains data of a snapshot of the OpenStreetMap database at a certain point in time. The full database contains even more data. It has all the changes that were ever made. The full version of the database with the complete history is contained in so called history files. They do require some special attention when processing. +An OSM data file usually contains the data of a snapshot of the OpenStreetMap +database at a certain point in time. The main OpenStreetMap database contains +not only the latest view of the data but every single change ever made. +This full version of OpenStreetMap editing history is published in so called +_history files_. pyosmium can process these files but they do +require some special attention. + +## Make-up of history files + +We have already discussed in the +[chapter on change files](08-Working-With-Change-Files.md) what +a single change to an OSM object looks like. It has the same format as an +ordinary OSM object in a snapshot, except that meta data properties like +`version` and `visibility` are important to take into account to understand +what part of the history the object belongs to. That means that history +files can be processed often with the same tools. Extra care just needs +to be taken because each OSM object appears in the file multiple times in +different versions. + +History files are conventionally sorted by OSM type, OSM ID and version. That +means that, when reading a history file sequentially like pyosmium does, then +all the versions of an object follow each other in order. However, it also +means that history files are not sorted by time. And this has important +implications when resolving references. + +## Resolving references in history files + +OSM objects only change when its own properties change. There is no new version +when the data changes that it refers to. For example, an OSM way will get +a new version, when a tag is modified or a node added to the list of nodes +that make up the way. When a node that is referenced by the way changes its +position, then the way remains the same. As a result, when working with way +geometries or objects created from relation members, it is not unusual that +between two versions of a way or relations, there are many hidden subversions +due to modifications to the nodes or members. In order to resolve these +subversions, you need to keep a cache of all the versions of the nodes, ways +and relations that are relevant and then use the timestamps of the object +to infer, which versions of the members are relevant between two versions +of the parent object. + +Note that you cannot use the standard caching mechanisms like the +[location storage](03-Working-with-Geometries.md#line-geometries) +or the standard area processor. These will only keep the latest version of +each object. There are currently no data structures supporting history files +in particular. + +!!! Danger + Timestamps are the only way to resolve the right version of dependent + objects. Still you need to take them with a grain of salt. In the early + days of OpenStreetMap, the servers creating the timestamps for new object + versions weren't always correctly in sync. And so it is possible to find + referential errors where a way refers to a node that according to the + timestamps hasn't existed, when the way was created. The synchronisation + issues have long since been resolved but you will encounter them when + working with historic data. From 4b17d2fa0d356caf74546d27fe89594b79c19961 Mon Sep 17 00:00:00 2001 From: Sarah Hoffmann Date: Fri, 19 Dec 2025 10:38:33 +0100 Subject: [PATCH 4/6] more documentation on the different writers --- docs/user_manual/06-Writing-Data.md | 60 +++++++++++++++++++++++++++-- 1 file changed, 57 insertions(+), 3 deletions(-) diff --git a/docs/user_manual/06-Writing-Data.md b/docs/user_manual/06-Writing-Data.md index b52f162..7adb888 100644 --- a/docs/user_manual/06-Writing-Data.md +++ b/docs/user_manual/06-Writing-Data.md @@ -18,6 +18,18 @@ pyosmium will refuse to overwrite any existing files. Either make sure to delete the files before instantiating a writer or use the parameter `overwrite=true`. +All writers are [context managers](https://docs.python.org/3/reference/datamodel.html#context-managers) and to ensure that the file is properly closed in the +end, the recommended way to use them is in a with statement: + +!!! example + ```python + with osmium.SimpleWriter('my_extra_data.osm.pbf') as writer: + # do stuff here + ``` + +When not used inside a with block, then don't forget to call the `close()` +function explicitly to close the writer. + Once a writer is instantiated, one of the `add*` functions can be used to add an OSM object to the file. You can either use one of the `add_node/way/relation` functions to force writing a specific type of @@ -27,9 +39,6 @@ they are given to the writer object. It is your responsibility as a user to make sure that the order is correct with respect to the [conventions for object order][order-in-osm-files]. -After writing all data the writer needs to be closed using the `close()` -function. It is usually easier to use a writer as a context manager. - Here is a complete example for a script that converts a file from OPL format to PBF format: @@ -129,3 +138,48 @@ pyosmium implements three different writer classes: the basic the two reference-completing writers [ForwardReferenceWriter][osmium.ForwardReferenceWriter] and [BackReferenceWriter][osmium.BackReferenceWriter]. + +### Writing specific objects only + +The [SimpleWriter][osmium.SimpleWriter] creates an OSM data file by directly +writing out any OSM object that it receives in the chosen format. + + +### Writing reference-complete files + +The [BackReferenceWriter][osmium.BackReferenceWriter] will make sure that the +file that is written out is reference-complete, meaning all objects that are +directly referenced by the object written are added to the output file as well. +This is needed when you want to make sure that geometries can be recreated +from the object in the file. + +Creating a file with backward references is a two-stage process: while the +writer is open, it will write all objects received through one of the `add_*()` +functions into a temporary file and keeps a record of which objects are needed +to make the file reference-complete. Once the writer is closed, it collects the +missing object from a given reference file, merges them with the data from +the temporary file and writes out the final result. + +### Writing files with forward references + +The [ForwardReferenceWriter][osmium.ForwardReferenceWriter] completes the +written objects with forward references. This is particularly useful when +creating geographic extracts of any kind: one selects the node of interest +in a particular area and then lets the ForwardReferenceWriter complete the +ways and relations referring to the nodes. + +Files written by the ForwardReferenceWriter are not necessarily +reference-complete. That is easy to see when considering the example of the +geographic extract: there may be ways in the area that cross the boundary +of the area chosen but only the nodes within the area are written out. This +might be useful in many situations as the way would be simply seem to be cut +on the area of interest. However, it has the disadvantage that some objects +will get invalid geometries, especially when they represent areas. + +The other thing to consider during forward completion are indirect references. +When completing relations indirectly referenced through ways or other relations, +then the resulting file can become big very quickly. For example, a seemingly +small extract of the city of Strasbourg can suddenly contain not only the +relations for France and Germany but also electoral boundaries and entire +timezones. For that reason, when forward-completing relations, it is not +recommended to use backward completion. From 48973e74f4cc02f92bae7cc4dfa220b52a191585 Mon Sep 17 00:00:00 2001 From: Sarah Hoffmann Date: Wed, 24 Dec 2025 09:42:19 +0100 Subject: [PATCH 5/6] docs: update installation instructions --- docs/index.md | 20 ++++++++++++-------- 1 file changed, 12 insertions(+), 8 deletions(-) diff --git a/docs/index.md b/docs/index.md index 1009631..68e7d77 100644 --- a/docs/index.md +++ b/docs/index.md @@ -25,27 +25,31 @@ the following additional dependencies need to be available: * [libosmium](https://github.com/osmcode/libosmium) >= 2.16.0 * [protozero](https://github.com/mapbox/protozero) * [cmake](https://cmake.org/) - * [Pybind11](https://github.com/pybind/pybind11) >= 2.2 * [expat](https://libexpat.github.io/) * [libz](https://www.zlib.net/) * [libbz2](https://www.sourceware.org/bzip2/) * [Boost](https://www.boost.org/) variant and iterator >= 1.41 * [Python Requests](https://docs.python-requests.org/en/master/) - * Python setuptools * a recent C++ compiler (Clang 3.4+, GCC 4.8+) +The following additional dependencies are automatically installed as part +of the build process: + + * [scikit-build-core](https://scikit-build-core.readthedocs.io/en/latest/) + * [Pybind11](https://github.com/pybind/pybind11) + On Debian/Ubuntu-like systems, the following command installs all required packages: sudo apt-get install python3-dev build-essential cmake libboost-dev \ libexpat1-dev zlib1g-dev libbz2-dev -libosmium, protozero and pybind11 are shipped with the source wheel. When -building from source, you need to download the source code and put it -in the subdirectory 'contrib'. Alternatively, if you want to put the sources -somewhere else, point pyosmium to the source code location by setting the -CMake variables `LIBOSMIUM_PREFIX`, `PROTOZERO_PREFIX` and -`PYBIND11_PREFIX` respectively. +Compatible versions of libosmium and protozero are shipped with the source +wheel. When building from source, you need to download the source code of these +two libraries and put it in the subdirectory 'contrib'. Alternatively, +if you already have the sources somewhere else, +point pyosmium to the source code location by setting the +CMake variables `Libosmium_ROOT` and `Protozero_ROOT`. To compile and install the bindings, run From 283e41c7ba311273277e8f8973b7dcf4c622f521 Mon Sep 17 00:00:00 2001 From: Sarah Hoffmann Date: Wed, 24 Dec 2025 10:53:50 +0100 Subject: [PATCH 6/6] docs: complete paragraph on meta information --- docs/user_manual/01-First-Steps.md | 8 +++++-- docs/user_manual/02-Extracting-Object-Data.md | 23 ++++++++++++++++--- 2 files changed, 26 insertions(+), 5 deletions(-) diff --git a/docs/user_manual/01-First-Steps.md b/docs/user_manual/01-First-Steps.md index 284ebb0..29ae9f2 100644 --- a/docs/user_manual/01-First-Steps.md +++ b/docs/user_manual/01-First-Steps.md @@ -140,7 +140,7 @@ out about the tags. It is also always useful to consult different keys and value in actual use. Tags are common to all OSM objects. After that there are three kinds of -objects in OSM: nodes, ways and relations. +object types in OSM: nodes, ways and relations. ### Nodes @@ -187,7 +187,7 @@ backward references when talking about the dependencies between objects: * A __forward reference__ means that an object is referenced to by another. Nodes appear in ways. Ways appear in relations. And a node may even have - an indirect forward reference to a relation through a way it appear in. + an indirect forward reference to a relation through a way it appears in. Forward references are important when tracking changes. When the location of a node changes, then all its forward references have to be reevaluated. @@ -198,6 +198,10 @@ backward references when talking about the dependencies between objects: to follow the backward references for ways and relations until we reach the nodes. +Closely related to backward references is the concept of __reference +completeness__. A dataset or file is considered reference complete when +all backward references can be resolved. + ## Order in OSM files OSM files usually follow a sorting convention to make life easier for diff --git a/docs/user_manual/02-Extracting-Object-Data.md b/docs/user_manual/02-Extracting-Object-Data.md index 4140cb2..b4b388d 100644 --- a/docs/user_manual/02-Extracting-Object-Data.md +++ b/docs/user_manual/02-Extracting-Object-Data.md @@ -14,7 +14,8 @@ Finally, there is a type for changesets, which contains information about edits in the OSM database. It can only appear in special changeset files and explained in more detail [below](#changeset). -The FileProcessor may return any of these objects, when iterating over a file. +When iterating over a file, then the FileProcessor may return any of these +objects. Therefore, a script will usually first need to determine the type of object received. There are a couple of ways to do this. @@ -83,7 +84,7 @@ You can simply test for this object type: ## Reading object tags Every object has a list of properties, the tags. They can be accessed through -the `tags` property, which provides a simple dictionary-like view of the tags. +the `tags` property. It provides a simple dictionary-like view of the tags. You can use the bracket notation to access a specific tag or use the more explicit `get()` function. Just like for Python dictionaries, an access by bracket raises a `ValueError` when the key you are looking for does not exist, @@ -140,7 +141,23 @@ list into a Python dictionary: ## Other common meta information Next to the tags, every OSM object also carries some meta information -describing its ID, version and information regarding the editor. +which all can be accessed through read-only properties. + +The most important meta information is the object's ID in the `id` property. +This is the ID used when objects reference each other. + +The other meta fields contain information when and by whom the objet was edited. +The following table gives a quick overview over these fields: + +| Property | Description | +|-----------|--------------------------| +| version | Version of the object. A newly created object starts with version 1. | +| deleted | A boolean property stating if the object should be used or ignored. Only relevant for [change](08-Working-With-Change-Files.md) and [history](09-Working-With-History-Files.md) files. | +| changeset | The ID of the change set this object was created with. A change set contains a set of edits that have been uploaded by an editor in a single session. | +| timestamp | UTC time at which the object was created, or more precisely, added to the database. | +| uid | The ID of the user who created this version of the object. User IDs are univocal and prepetual. | +| user | The name of the user who created this version of the object. This is the name the user had when the object was created. User names may be changed over time. The same name in different objects doesn't necessarily reference the same user. | + ## Properties of OSM object types