Dimensions in the PyGEM netcdf outputs #150

yelizy · 2025-10-25T10:39:42Z

yelizy
Oct 25, 2025
Collaborator

I have a brief question regarding the dimensions in the PyGEM outputs.

In the output files, there are three dimensions defined, namely glac, time, year. As I understand it, glac corresponds to glacier index and should uniquely identify each glacier, right? In that case the max value for this field should be the same as the total number of glaciers within each region. In my regional simulations Scandinavia, this does not match.

I found in the code where it is defined as
self.glac_values = np.array([self.glacier_rgi_table.name]).

In the reference RGI table, the column 'name' does not have unique names for each glaciers, therefore there are many blank cells. Am I missing something here or did I misunderstand the intention of defining a glacier index dimension? I would really appreciate your clarification on this.

Answered by btobers

Oct 29, 2025

The reason for that mismatch may be due to using multiprocessing. Did you run your simulations across multiple cores? If so, then the index may only go up to the total number of glaciers//Njobs. Did all your 3416 simulations get exported?

View full answer

btobers · 2025-10-25T13:27:27Z

btobers
Oct 25, 2025
Maintainer

Hi @yelizy,

There may be some confusion depending on which output files you are looking at. Are the output files you are looking at the output from a single glacier (produced by running run_simulation.py and names with the RGIId in the filename), or after compiling the results for your entire region (produced by running postproc_compile_simulations.py and named with the region in the filename)? The reason I ask is because the compiled regional output should not have a glac coordinate, but rather should have an RGIId coordinate, e.g.:

The single-glacier output for a given run will just be the results for one glacier. When you compile all glaciers using the postprocessing script, you should have N glacier's with the RGIId of each glacier stored along the RGIId coordinate, where N is the number of glaciers in your region (sub-annual data are split into 1000-glacier batches by default). See these relevant lines for reference. Even failed glaciers should be stored in this regionally compiled output - the values of the variable of interest will just be nan's. Here's a notebook demonstrating the regional aggregation workflow.

Here's an example of an output regional run for Iceland showing that all glaciers are contained within the file:

0 replies

yelizy · 2025-10-29T10:30:44Z

yelizy
Oct 29, 2025
Collaborator Author

Hi @btobers ,

Thanks so much for your prompt reply. You are right, I didn’t specify which outputs I am looking at.

I was referring to the individual glacier outputs. There, the dimension glac is not uniquely defined for each glacier. Other dimensions (time and year) seem correctly defined. Here are a few examples :

In the postprocessing script, the number of glaciers is carefully checked before creating batches which is great. But in that case :

What is the functionalty of the glacier index variable in the outputs? Is it because we can not use RGIID as dimension since it is a string field?
Do we use this dimension info anywhere else in the workflow?

It seems to me that when np.array is defined based on glacier names (in my case between 0,.., 213) in the RGI table, it does not match the total number of glaciers (3417). Somehow the array seems to conform to the glacier length when producing outputs.

0 replies

btobers · 2025-10-29T14:47:00Z

btobers
Oct 29, 2025
Maintainer

Hi @yelizy,

Great questions. The short answer is that this index is not used after storing the simulations, and we should probably modify this structure. Each individual simulation output will only have a single glac index. In fact, we could possibly just remove the glac index from the individual outputs altogether, since there is always only one glacier. If you are trying to access the results form an individual glacier output, you can simply index into the 0th glac, similar to what's done in the various example notebooks (e.g., simple_test.ipynb) When the simulations are then merged by region, the RGIId is stored along the glacier index. @drounce can correct me if I'm wrong, but I believe the reason the individual simulation were originally stored as 2d arrays (e.g.,glacxyear) was because then it was easier to stack them regionally in post-processing.

A bit more detail: the reason the glac value may seem ambiguous has to do with a subtlety in how the rgi glacier table is indexed into in the run_simulation script when looping through the list of glaciers in a given run. In run_simulation.py, we index into the rgi glacier table. Pandas default behavior is then to store the 'name' of the resulting series based on the index of the row in your main_glacier_rgi dataframe. For example, if I do a run for 1.00570 and 1.00571 together:
run_simulation -rgi_glac_number 1.00570 1.00571 ....
My main_glac_rgi dataframe will look like so:

This study is focusing on 2 glaciers in region [1]
   O1Index           RGIId   CenLon  ...  rgino_str  RGIId_float  CenLon_360
0      569  RGI60-01.00570 -145.427  ...   01.00570      1.00570     214.573
1      570  RGI60-01.00571 -145.449  ...   01.00571      1.00571     214.551

What becomes the 'name' key in our resulting series as we loop through each glacier is the index in main_glac_rgi (e.g., 0 for 1.00570 and 1 for 1.00571). These are the values that get stored under the glac coordinate of the simulation output. So if you ran say 200+ glaciers as your post above indicates, you may will have values that correspond to the range of glaciers in your run under the glac index - but there should always be just one index per output.

If you an an entire region, the glac values should correspond to the RGIId -1. For instance if we ran all of Alaska then 1.00570 would have glac.values=569 in the output file for 1.00570. Sorry for the long-winded explanation, but does this make sense?

Again, in summary, the glac value does not matter, as you will only have one in your individual outputs, but looking at the values of glac can certainly be confusing and we should improve this.

2 replies

btobers Oct 29, 2025
Maintainer

And just to add perhaps more confusion to this - the default behavior is to drop the actual "Name" column in the RGI glacier dataset, as specified in the configuration file. So, the name attribute you see in the output.py module is not the same Name as in the originally loaded RGI dataset. The actual RGI glacier name (e.g. "Gulkana Glacier" for RGI60-01.00570) is not currently stored in the output files.

btobers Oct 29, 2025
Maintainer

I think perhaps in pygem_modelsetup.selectglaciersrgitable() we should set the index to be the "RGIId", or perhaps "01Index", which would then in theory be stored as the glac value of each individual output rather than the integer-based index of a given glacier in the model's main_glac_rgi dataframe which is currently stored. However, we'd need to explore how xarray would interface with this when storing the netcdf files. I'll raise this as an issue.

yelizy · 2025-10-29T15:27:13Z

yelizy
Oct 29, 2025
Collaborator Author

Thanks so much for your detailed explanation @btobers. It was important to clarify that "Name" column was already dropped from the original RGI data, and the name field used from the main_glac_rgi dataframe is about the index in the dataframe. In this case, I would still expect my glac.values to range from 0 to 3416, but it goes up to 213. I am confused about this mismatch.

Anyways, if these values are not used in the other parts of the workflow, I believe it should be fine to keep it the way it is.

2 replies

btobers Oct 29, 2025
Maintainer

The reason for that mismatch may be due to using multiprocessing. Did you run your simulations across multiple cores? If so, then the index may only go up to the total number of glaciers//Njobs. Did all your 3416 simulations get exported?

Answer selected by yelizy

yelizy Oct 30, 2025
Collaborator Author

Voila! I didn't think about it, you are right! I used 16 processors and this explains the numbers. Then it would be important to define the dimension values when not running in the parallel mode.

Dimensions in the PyGEM netcdf outputs #150

Uh oh!

yelizy Oct 25, 2025 Collaborator

Replies: 4 comments · 4 replies

Uh oh!

Uh oh!

btobers Oct 25, 2025 Maintainer

Uh oh!

yelizy Oct 29, 2025 Collaborator Author

Uh oh!

Uh oh!

btobers Oct 29, 2025 Maintainer

Uh oh!

btobers Oct 29, 2025 Maintainer

Uh oh!

btobers Oct 29, 2025 Maintainer

Uh oh!

yelizy Oct 29, 2025 Collaborator Author

Uh oh!

btobers Oct 29, 2025 Maintainer

Uh oh!

yelizy Oct 30, 2025 Collaborator Author

yelizy
Oct 25, 2025
Collaborator

Replies: 4 comments 4 replies

btobers
Oct 25, 2025
Maintainer

yelizy
Oct 29, 2025
Collaborator Author

btobers
Oct 29, 2025
Maintainer

btobers Oct 29, 2025
Maintainer

btobers Oct 29, 2025
Maintainer

yelizy
Oct 29, 2025
Collaborator Author

btobers Oct 29, 2025
Maintainer

yelizy Oct 30, 2025
Collaborator Author