Skip to content

error while updating station info in __init__ function in lmafile class  #48

@gewitterblitz

Description

@gewitterblitz

In the file pyxlma/lmalib/io/read.py, the lmafile class is initialized such that the code in L421-424 removes redundant info:

# Drop the station name column that has a redundant station letter code
        # as part of the name and join on station letter code.
        station_combo =  stations.set_index('ID').drop(columns=['Name']).join(
                             overview.set_index('ID'))

I think there's an issue with the way the two dataframes (stations and overview) are being joined together as explained below. This issue results in a KeyError when reading some LMA files. For example, here is what I tried:

lma_data, starttime = read.dataset(filename), which throws the following error:

File [/t1/sharma/xlma-python/pyxlma/lmalib/io/read.py:498](http://127.0.0.1:9015/lab/tree/env/notebooks/xlma-python/pyxlma/lmalib/io/read.py#line=497), in lmafile.readfile(self)
    495     lmad.insert(8,col_names[index],
    496                 (self.mask_ints>>index)%2)
    497 # Count the number of stations contributing and put in a new column
--> 498 lmad.insert(8,'Station Count',lmad[col_names].sum(axis=1).astype('uint8'))
    499 self.station_contrib_cols = col_names
    501 # Version for using only station symbols. Not as robust.
    502 # for index,items in enumerate(self.maskorder[::-1]):
    503 #     lmad.insert(8,items,(mask_to_int(lmad["mask"])>>index)%2)
    504 # # Count the number of stations contributing and put in a new column
    505 # lmad.insert(8,'Station Count',lmad[list(self.maskorder)].sum(axis=1))

File [~/anaconda3/envs/env/lib/python3.12/site-packages/pandas/core/frame.py:4108](http://127.0.0.1:9015/lab/tree/env/notebooks/~/anaconda3/envs/env/lib/python3.12/site-packages/pandas/core/frame.py#line=4107), in DataFrame.__getitem__(self, key)
   4106     if is_iterator(key):
   4107         key = list(key)
-> 4108     indexer = self.columns._get_indexer_strict(key, "columns")[1]
   4110 # take() does not accept boolean indexers
   4111 if getattr(indexer, "dtype", None) == bool:

File [~/anaconda3/envs/env/lib/python3.12/site-packages/pandas/core/indexes/base.py:6200](http://127.0.0.1:9015/lab/tree/env/notebooks/~/anaconda3/envs/env/lib/python3.12/site-packages/pandas/core/indexes/base.py#line=6199), in Index._get_indexer_strict(self, key, axis_name)
   6197 else:
   6198     keyarr, indexer, new_indexer = self._reindex_non_unique(keyarr)
-> 6200 self._raise_if_missing(keyarr, indexer, axis_name)
   6202 keyarr = self.take(indexer)
   6203 if isinstance(key, Index):
   6204     # GH 42790 - Preserve name from an Index

File [~/anaconda3/envs/env/lib/python3.12/site-packages/pandas/core/indexes/base.py:6252](http://127.0.0.1:9015/lab/tree/env/notebooks/~/anaconda3/envs/env/lib/python3.12/site-packages/pandas/core/indexes/base.py#line=6251), in Index._raise_if_missing(self, key, indexer, axis_name)
   6249     raise KeyError(f"None of [{key}] are in the [{axis_name}]")
   6251 not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique())
-> 6252 raise KeyError(f"{not_found} not in index")

KeyError: "['OKC', 'FAA'] not in index"

This occurs because when the dataframes are joined using just the station ids (with the names removed from the station dataframe), pandas gets confused when more than one station share the same ID and instead duplicates some rows resulting in extra rows.

Here's how the station_combo variable looks with the current code. Notice the duplicate rows containing OKC. The resultant dataframe contains exactly 21 stations (instead of the default number: 19).

Screenshot 2024-08-25 at 5 27 03 PM

An easy fix to this problem is to simply join the two dataframes based on both station_id and station_name in the following manner:

station_combo = stations.set_index(['ID','Name']).join(overview.set_index(['ID','Name']))

Here's how the station_combo variable will looks with the proposed solution. The resultant dataframe contains exactly 19 stations.

Screenshot 2024-08-25 at 5 29 45 PM

Please review this issue. I will follow-up with a PR in a few minutes to resolve it.

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions