-
Notifications
You must be signed in to change notification settings - Fork 9
Description
Please update this issue as updates are made / we notice things
Updated after #39, but the changes here aren't just because of that, geographic info on the PPX side has also improved
There's a few aspects to this:
- [ingest] Adding missing (geographic) metadata via annotations_geo.tsv
- [ingest] Fixing typos / formatting via geolocation_rules.tsv
- [ingest] Fixing incorrect metadata
- [phylo] Adding missing lat/longs. These are shown as warnings when running the
exportrule
Missing geo metadata
Ideally we'd get complete country & division for all samples. Within ingest/ running ./scripts/summarise_geography.py --m1 data/metadata_geo_improvements.tsv will summarise the geographic combinations present which makes it easy(ish) to see missing countries, divisions etc. Quite a few of the missing values are from sequence fragments, so filtering to 18kb (as we do in the phylo workflows) makes our task easier.
augur filter --sequences results/sequences.fasta --metadata data/metadata_geo_improvements.tsv --metadata-id-columns accession --min-length 18000 --output-metadata data/check-geo.tsv
./scripts/summarise_geography.py --m1 data/check-geo.tsv
Here's a summary of that output focusing on combinations which are missing country/ division, sorted by count: (no strains are missing country any more)
| num | country | division | location |
|---|---|---|---|
| 775 | Democratic Republic of the Congo | ||
| 454 | Sierra Leone | ||
| 99 | Gabon | ||
| 16 | Liberia | ||
| 10 | Guinea | ||
| 4 | Uganda | ||
| 2 | Mali |
Incorrect geo metadata
(From #24): The 2 Kelle samples (PP_000LEP0, PP_000LEQY) have country of DRC but Kelle is in Republic of Congo. This needs investigating.