Skip to content

Conversation

@huddlej
Copy link
Contributor

@huddlej huddlej commented Jan 25, 2022

Description of proposed changes

Removes older references from the list of strains to force include in
analyses. We use the single Wuhan/Hu-1/2019 strains as the reference for
alignment and time tree rooting, so the other two strains are not
required. Importantly, these additional strains get used in proximity
calculations with priority-based subsampling leading to the unexpected
inclusion of contextual strains that look like these redundant root
sequences. We try to avoid this problem in the proximity calculations by
defining a list of strains to ignore
, but this list only includes
the strain used to root the time tree. Rather than updating this list to
include more strains, we can just remove the strains from the include
file.

Testing

  • Tested by CI
  • Tested by Cassia

Removes older references from the list of strains to force include in
analyses. We use the single Wuhan/Hu-1/2019 strains as the reference for
alignment and time tree rooting, so the other two strains are not
required. Importantly, these additional strains get used in proximity
calculations with priority-based subsampling leading to the unexpected
inclusion of contextual strains that look like these redundant root
sequences. We try to avoid this problem in the proximity calculations by
defining a list of strains to ignore [1], but this list only includes
the strain used to root the time tree. Rather than updating this list to
include more strains, we can just remove the strains from the include
file.

[1] https://github.com/nextstrain/ncov/blob/51ac8143761057c14de3841e46b82e08f7fef4b6/workflow/snakemake_rules/main_workflow.smk#L372
@huddlej huddlej added the bug Something isn't working label Jan 25, 2022
@huddlej huddlej self-assigned this Jan 25, 2022
@huddlej huddlej requested a review from cassiawag January 25, 2022 19:26
Defines a list of strains to ignore, including the two reference
sequences we use for GISAID and GenBank builds and passes this list to
the `ignore_seqs` parameter of the proximity calculations instead of the
current build's root sequence alone.
Copy link
Collaborator

@cassiawag cassiawag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this out on ncov-kenya, and the proximity sequences & priority scores were as expected. Thanks @huddlej!

@huddlej huddlej requested a review from rneher January 26, 2022 00:22
@huddlej
Copy link
Contributor Author

huddlej commented Jan 26, 2022

@rneher Does this look ok to you, too? This bit of the workflow is just complicated enough that I can imagine I'm missing something with this change...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

No open projects
Status: In Review

Development

Successfully merging this pull request may close these issues.

3 participants