error while updating station info in __init__ function in  lmafile class 

In the file `pyxlma/lmalib/io/read.py`, the lmafile class is initialized such that the code in L421-424 removes redundant info:

```
# Drop the station name column that has a redundant station letter code
        # as part of the name and join on station letter code.
        station_combo =  stations.set_index('ID').drop(columns=['Name']).join(
                             overview.set_index('ID'))
```
I think there's an issue with the way the two dataframes (stations and overview) are being joined together as explained below. This issue results in a `KeyError` when reading some LMA files. For example, here is what I tried:

`lma_data, starttime = read.dataset(filename)`, which throws the following error:

```
File [/t1/sharma/xlma-python/pyxlma/lmalib/io/read.py:498](http://127.0.0.1:9015/lab/tree/env/notebooks/xlma-python/pyxlma/lmalib/io/read.py#line=497), in lmafile.readfile(self)
    495     lmad.insert(8,col_names[index],
    496                 (self.mask_ints>>index)%2)
    497 # Count the number of stations contributing and put in a new column
--> 498 lmad.insert(8,'Station Count',lmad[col_names].sum(axis=1).astype('uint8'))
    499 self.station_contrib_cols = col_names
    501 # Version for using only station symbols. Not as robust.
    502 # for index,items in enumerate(self.maskorder[::-1]):
    503 #     lmad.insert(8,items,(mask_to_int(lmad["mask"])>>index)%2)
    504 # # Count the number of stations contributing and put in a new column
    505 # lmad.insert(8,'Station Count',lmad[list(self.maskorder)].sum(axis=1))

File [~/anaconda3/envs/env/lib/python3.12/site-packages/pandas/core/frame.py:4108](http://127.0.0.1:9015/lab/tree/env/notebooks/~/anaconda3/envs/env/lib/python3.12/site-packages/pandas/core/frame.py#line=4107), in DataFrame.__getitem__(self, key)
   4106     if is_iterator(key):
   4107         key = list(key)
-> 4108     indexer = self.columns._get_indexer_strict(key, "columns")[1]
   4110 # take() does not accept boolean indexers
   4111 if getattr(indexer, "dtype", None) == bool:

File [~/anaconda3/envs/env/lib/python3.12/site-packages/pandas/core/indexes/base.py:6200](http://127.0.0.1:9015/lab/tree/env/notebooks/~/anaconda3/envs/env/lib/python3.12/site-packages/pandas/core/indexes/base.py#line=6199), in Index._get_indexer_strict(self, key, axis_name)
   6197 else:
   6198     keyarr, indexer, new_indexer = self._reindex_non_unique(keyarr)
-> 6200 self._raise_if_missing(keyarr, indexer, axis_name)
   6202 keyarr = self.take(indexer)
   6203 if isinstance(key, Index):
   6204     # GH 42790 - Preserve name from an Index

File [~/anaconda3/envs/env/lib/python3.12/site-packages/pandas/core/indexes/base.py:6252](http://127.0.0.1:9015/lab/tree/env/notebooks/~/anaconda3/envs/env/lib/python3.12/site-packages/pandas/core/indexes/base.py#line=6251), in Index._raise_if_missing(self, key, indexer, axis_name)
   6249     raise KeyError(f"None of [{key}] are in the [{axis_name}]")
   6251 not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique())
-> 6252 raise KeyError(f"{not_found} not in index")

KeyError: "['OKC', 'FAA'] not in index"
```

This occurs because when the dataframes are joined using just the station ids (with the names removed from the station dataframe), pandas gets confused when more than one station share the same ID and instead duplicates some rows resulting in extra rows.

Here's how the `station_combo` variable looks with the current code. Notice the duplicate rows containing OKC. The resultant dataframe contains exactly 21 stations (instead of the default number: 19).

<img width="1000" alt="Screenshot 2024-08-25 at 5 27 03 PM" src="https://github.com/user-attachments/assets/dc1c1913-5bae-49ac-b1aa-1e47954359df">

An easy fix to this problem is to simply join the two dataframes based on both station_id and station_name in the following manner:

`station_combo = stations.set_index(['ID','Name']).join(overview.set_index(['ID','Name']))`

Here's how the `station_combo` variable will looks with the proposed solution. The resultant dataframe contains exactly 19 stations.

<img width="997" alt="Screenshot 2024-08-25 at 5 29 45 PM" src="https://github.com/user-attachments/assets/6d51b1b3-ff32-49b5-bcac-19dd8ae332d9">


Please review this issue. I will follow-up with a PR in a few minutes to resolve it.

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

error while updating station info in init function in lmafile class #48

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

error while updating station info in __init__ function in lmafile class #48

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

error while updating station info in init function in lmafile class #48