-
Notifications
You must be signed in to change notification settings - Fork 7
Description
In the file pyxlma/lmalib/io/read.py, the lmafile class is initialized such that the code in L421-424 removes redundant info:
# Drop the station name column that has a redundant station letter code
# as part of the name and join on station letter code.
station_combo = stations.set_index('ID').drop(columns=['Name']).join(
overview.set_index('ID'))
I think there's an issue with the way the two dataframes (stations and overview) are being joined together as explained below. This issue results in a KeyError when reading some LMA files. For example, here is what I tried:
lma_data, starttime = read.dataset(filename), which throws the following error:
File [/t1/sharma/xlma-python/pyxlma/lmalib/io/read.py:498](http://127.0.0.1:9015/lab/tree/env/notebooks/xlma-python/pyxlma/lmalib/io/read.py#line=497), in lmafile.readfile(self)
495 lmad.insert(8,col_names[index],
496 (self.mask_ints>>index)%2)
497 # Count the number of stations contributing and put in a new column
--> 498 lmad.insert(8,'Station Count',lmad[col_names].sum(axis=1).astype('uint8'))
499 self.station_contrib_cols = col_names
501 # Version for using only station symbols. Not as robust.
502 # for index,items in enumerate(self.maskorder[::-1]):
503 # lmad.insert(8,items,(mask_to_int(lmad["mask"])>>index)%2)
504 # # Count the number of stations contributing and put in a new column
505 # lmad.insert(8,'Station Count',lmad[list(self.maskorder)].sum(axis=1))
File [~/anaconda3/envs/env/lib/python3.12/site-packages/pandas/core/frame.py:4108](http://127.0.0.1:9015/lab/tree/env/notebooks/~/anaconda3/envs/env/lib/python3.12/site-packages/pandas/core/frame.py#line=4107), in DataFrame.__getitem__(self, key)
4106 if is_iterator(key):
4107 key = list(key)
-> 4108 indexer = self.columns._get_indexer_strict(key, "columns")[1]
4110 # take() does not accept boolean indexers
4111 if getattr(indexer, "dtype", None) == bool:
File [~/anaconda3/envs/env/lib/python3.12/site-packages/pandas/core/indexes/base.py:6200](http://127.0.0.1:9015/lab/tree/env/notebooks/~/anaconda3/envs/env/lib/python3.12/site-packages/pandas/core/indexes/base.py#line=6199), in Index._get_indexer_strict(self, key, axis_name)
6197 else:
6198 keyarr, indexer, new_indexer = self._reindex_non_unique(keyarr)
-> 6200 self._raise_if_missing(keyarr, indexer, axis_name)
6202 keyarr = self.take(indexer)
6203 if isinstance(key, Index):
6204 # GH 42790 - Preserve name from an Index
File [~/anaconda3/envs/env/lib/python3.12/site-packages/pandas/core/indexes/base.py:6252](http://127.0.0.1:9015/lab/tree/env/notebooks/~/anaconda3/envs/env/lib/python3.12/site-packages/pandas/core/indexes/base.py#line=6251), in Index._raise_if_missing(self, key, indexer, axis_name)
6249 raise KeyError(f"None of [{key}] are in the [{axis_name}]")
6251 not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique())
-> 6252 raise KeyError(f"{not_found} not in index")
KeyError: "['OKC', 'FAA'] not in index"
This occurs because when the dataframes are joined using just the station ids (with the names removed from the station dataframe), pandas gets confused when more than one station share the same ID and instead duplicates some rows resulting in extra rows.
Here's how the station_combo variable looks with the current code. Notice the duplicate rows containing OKC. The resultant dataframe contains exactly 21 stations (instead of the default number: 19).
An easy fix to this problem is to simply join the two dataframes based on both station_id and station_name in the following manner:
station_combo = stations.set_index(['ID','Name']).join(overview.set_index(['ID','Name']))
Here's how the station_combo variable will looks with the proposed solution. The resultant dataframe contains exactly 19 stations.
Please review this issue. I will follow-up with a PR in a few minutes to resolve it.
Thanks!