Skip to content

Geographic fixes #34

@jameshadfield

Description

@jameshadfield

Please update this issue as updates are made / we notice things

Updated after #39, but the changes here aren't just because of that, geographic info on the PPX side has also improved


There's a few aspects to this:

  • [ingest] Adding missing (geographic) metadata via annotations_geo.tsv
  • [ingest] Fixing typos / formatting via geolocation_rules.tsv
  • [ingest] Fixing incorrect metadata
  • [phylo] Adding missing lat/longs. These are shown as warnings when running the export rule

Missing geo metadata

Ideally we'd get complete country & division for all samples. Within ingest/ running ./scripts/summarise_geography.py --m1 data/metadata_geo_improvements.tsv will summarise the geographic combinations present which makes it easy(ish) to see missing countries, divisions etc. Quite a few of the missing values are from sequence fragments, so filtering to 18kb (as we do in the phylo workflows) makes our task easier.

augur filter --sequences results/sequences.fasta --metadata data/metadata_geo_improvements.tsv --metadata-id-columns accession --min-length 18000 --output-metadata data/check-geo.tsv
./scripts/summarise_geography.py --m1 data/check-geo.tsv

Here's a summary of that output focusing on combinations which are missing country/ division, sorted by count: (no strains are missing country any more)

num country division location
775 Democratic Republic of the Congo
454 Sierra Leone
99 Gabon
16 Liberia
10 Guinea
4 Uganda
2 Mali

Incorrect geo metadata

(From #24): The 2 Kelle samples (PP_000LEP0, PP_000LEQY) have country of DRC but Kelle is in Republic of Congo. This needs investigating.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions