::: {align="center"}
parsEO is a Python toolkit that parses existing Earth Observation filenames into structured metadata and assembles new filenames from validated fields. Filename rules live in JSON schemas, turning the repository into a single source of truth for product nomenclature. :::
- Bidirectional workflows – convert filenames to Python dictionaries and rebuild filenames from the same fields.
- Schema-driven – JSON schemas define every product token, validation rule, and STAC link. Drop in a new schema file and it is discovered automatically.
- Extensible – contributions focus on data, not glue code. Adding support for a new mission only requires a schema.
- Sentinel missions – S1, S2, S3, S4, S5P, S6
- Landsat – LT04, LT05, LE07, LC08, LC09
- NASA MODIS – Terra and Aqua MODIS products
- EUMETSAT missions – MTG, Metop
- Copernicus Land Monitoring Service (CLMS)
- Corine Land Cover (CLC) and CLC+ Raster
- European Ground Motion Service (EGMS) Level 2 basic/calibrated products, Level 3 velocity grid, and GNSS model
- Urban Atlas Land Cover / Land Use
- High Resolution Vegetation Phenology & Productivity (HR-VPP): Seasonal Trajectories (PPI) and Vegetation Indices (FAPAR, LAI, NDVI, PPI, FCOVER, DMP)
- High Resolution Water & Snow / Ice (HR-WSI): CC, FSC, GFSC, ICD, SWS, WDS, WIC, WIC-COMB, SP_S2, SP_COMB
- High Resolution Layers (HRL): Forest Type, Grassland, Imperviousness, Non-Vegetated Land Cover Characteristics, Tree Cover Density, Water & Wetness, Small Woody Features
pip install parseoFor development installs:
git clone https://github.com/MatMatt/parsEO.git
cd parsEO
pip install -e .Confirm the CLI is available:
parseo --versionfrom parseo import parse_auto
name = "S2B_MSIL2A_20241123T224759_N0511_R101_T03VUL_20241123T230829.SAFE"
result = parse_auto(name)
print(result.valid) # True
print(result.fields) # structured dict of extracted fieldsAnother example using a MODIS filename:
from parseo import parse_auto
name = "MOD09GA.A2021123.h18v04.006.2021132234506.hdf"
result = parse_auto(name)
print(result.fields["platform"]) # MOD
print(result.fields["product"]) # 09
print(result.fields["variant"]) # GAThe MODIS schema advertises the Processing, Sat, Raster, and Electro-Optical STAC extensions, enabling automated STAC metadata generation (processing, sat, raster, eo).
from parseo import assemble, assemble_auto
fields = {
"platform": "S2B",
"instrument": "MSI",
"processing_level": "L2A",
"sensing_datetime": "20241123T224759",
"processing_baseline": "N0511",
"relative_orbit": "R101",
"tile_id": "T03VUL", # MGRS tile (TxxYYY, e.g. T32TNS)
"generation_datetime": "20241123T230829",
"extension": "SAFE",
}
assemble_auto(fields)
# 'S2B_MSIL2A_20241123T224759_N0511_R101_T03VUL_20241123T230829.SAFE'
assemble(fields, family="S2")
# Uses the default (current) Sentinel-2 schema for quicker resolution.
assemble(fields, family="S2", version="1.0.0")
# Lock both family and schema version explicitly.To force a specific schema file:
from pathlib import Path
from parseo import assemble
schema_path = Path("src/parseo/schemas/copernicus/sentinel/s2/s2_filename_v1_0_0.json")
filename = assemble(fields, schema_path=schema_path)parse_auto reports the schema version and lifecycle status that were used:
from parseo import parse_auto
result = parse_auto("S2B_MSIL2A_20241123T224759_N0511_R101_T03VUL_20241123T230829.SAFE")
print(result.version) # '1.0.0'
print(result.status) # 'current'The same functionality is exposed through the CLI.
# Parse a filename
parseo parse S2B_MSIL2A_20241123T224759_N0511_R101_T03VUL_20241123T230829.SAFE
# Parse a Copernicus Land Monitoring Service (CLMS) filename
parseo parse ST_20240101T123045_S2_E15N45-03035-010m_V100_PPI.tif
# Parse and write the JSON response to a file
parseo parse ST_20240101T123045_S2_E15N45-03035-010m_V100_PPI.tif --output result.json
# Assemble from the saved JSON document. The CLI automatically
# extracts the `fields` entry produced by `parseo parse`.
cat result.json | parseo assemble --family copernicus:clms:hr-vpp:st --fields-json -
# Pipe the parse result directly into assemble
parseo parse ST_20240101T123045_S2_E15N45-03035-010m_V100_PPI.tif \
| parseo assemble --family copernicus:clms:hr-vpp:st
# Assemble the same CLMS filename from key=value pairs
parseo assemble
product=ST
timestamp=20240101T123045
sensor=S2
tile_id=E15N45
epsg_code=03035
resolution=010m
version=V100
variable=PPI
extension=tifList every schema that ships with parsEO, including its fully qualified schema_id, semantic version, lifecycle status, and file location:
parseo list-schemasThe command prints a table summarizing all discovered schemas, making it easy to confirm which versions are available before parsing or assembling filenames. The first column shows the complete namespace (for example copernicus:clms:hrl:vlcc), so you can see exactly where a product sits inside the Copernicus hierarchy.
List only the schemas that are marked as current with the built-in filter (works on every platform):
parseo list-schemas --status currentTo inspect a single family in the summary table, provide the short family name explicitly. You can also pass a namespace prefix to see every schema beneath it:
parseo list-schemas --family S2
parseo list-schemas --family copernicus
parseo list-schemas --family copernicus:clms:hrlFor the full schema metadata (fields, examples, etc.), use schema-info:
parseo schema-info S2The JSON response includes the schema's semantic version and lifecycle status so you can immediately tell which release you are viewing. To inspect an archived version, provide the version number explicitly:
parseo schema-info --version 1.0.0 CLCWhen multiple schema versions are present, parsEO chooses the one whose status is "current". If none are marked current, the highest schema_version wins. You can inspect historical schemas with parseo schema-info --version <x.y.z> <family> and you can always pin a schema by passing schema_path to assemble or parse.
Adding a schema requires only a JSON document under src/parseo/schemas/. Start from an existing product schema or the skeleton in template/.
-
Create the product directory –
src/parseo/schemas/<family>/<mission>/<product>/(only the family level is mandatory). -
Write the versioned schema file –
<product>_filename_vX_Y_Z.jsonwith required metadata (schema_id,schema_version,status, optionalstac_version,stac_extensions, anddescription). -
Maintain versions – mark the latest schema as
"current"and move older ones to"deprecated"(or similar). parsEO uses these flags to pick defaults.
fields is a container object to define filename tokens and translation mechanisms from/to STAC. Each property combines JSON Schema keywords (type, pattern, enum, …) and optional documentation. The property name is the variable name used by the template container and unless a stac_map is applied also what is retrieved by parseo parse or fed into parseo assemble.
"fields": {
"property-1": {
"type": "string",
"enum": ["A","B"],
"description": "descripton of property 1"
},
"property-2": {
"type": "string",
"pattern": "^(\d{8}T\d{6})$",
"description": "descripton of property 2"
},
...
}The keyword type is currently set to "string" by default, and used as best practice.
- Use
enumfor short, controlled vocabularies such as file extensions or processing modes (["A","B"]). It keeps validation strict and self-documenting. - Use
patternfor structured tokens like timestamps (^\d{8}T\d{6}$), version identifiers (^V\d{3}$), or grid identifiers (^h\d{2}v\d{2}$).
Use oneOf to express mutually exclusive validation branches for a single field. Each branch can provide its own JSON Schema keywords, enabling concise modelling of tokens that admit multiple formats. For example, a variant token may allow either a fixed set of legacy values or a future-proof pattern:
"variant": {
"type": "string",
"oneOf": [
{"enum": ["A", "B", "C"]},
{"pattern": "^X\d{2}$"},
{"pattern": "^z\d{4}$"}
],
"description": "Legacy letters or experimental codes"
}When parseo parse runs, the value is checked against each branch in order until one succeeds. parseo assemble accepts any value satisfying at least one branch, allowing schemas to stay expressive without duplicating fields.
Functionality to connect filename tokens to STAC properties in case of incompatibility. In the following example parseo parse matches the filename field prefix but returns the STAC compliant platform and instrument. parseo assemble must be provided with platform and instrument and converts it to the correct prefix as defined by the filename template.
"prefix": {
"type": "string",
"pattern": "^(MOD|MYD|MCD)$",
"stac_map": {
"MOD": {
"platform": "Terra",
"instrument": "MODIS"
},
"MYD": {
"platform": "Aqua",
"instrument": "MODIS"
},
"MCD": {
"platform": "Combined",
"instrument": "MODIS"
}
}
}The next example uses capture groups ($1, $2, …). While tile is matched in the filename parseo parse returns tile, horizontal_grid, and vertical_grid (and parseo assemble requires).
"tile": {
"type": "string",
"pattern": "^(h\d{2})(v\d{2})$",
"stac_map": {
"tile": "$0",
"horizontal_grid": "$1",
"vertical_grid": "$2"
}
}A filename is assembled/parsed based on the definition put into template container object. E.g.:
"template": "{variable}-C{reference_year}_R{resolution}_{tile_id}{epsg_code}_{version}[.{extension}]"fields are added to the template using the {property name}. The template further decides for the fields are arranged and how they are separated from each other. Further it is possible to ad permanent elements as well, e.g. adding a prefix C to the {reference_year} = C{reference_year} to match e.g. 'C2012'. This can be particularly useful if you are interested in only parts of the field, e.g. by doping EPSG{epsg_code} you fetch only the code itself. This can also be a regex, e.g. ..._(S|C){reference_year}_....
Optional filenames must be encapsulated in []. E.g. [.{extension}]
Populate "examples" with valid filenames covering typical combinations of optional tokens. Test routines will run the examples to assess the proper functioning.
run parseo parse <filename> and parseo assemble --schema <schema_path> to confirm the schema behaves as expected.
- Place new schemas under
src/parseo/schemas/<product_family>/. - Include at least one positive example in each schema file.
- Run the test-suite with
pytest(andruff check .for linting) before opening a pull request. - Submit a pull request describing the new products or fixes.
This project is licensed under the European Union Public Licence v1.2.