Skip to content

Conversation

@victorlin
Copy link
Member

@victorlin victorlin commented Oct 2, 2025

Description of proposed changes

augur subsample is built to replace the simple augur filter call with a command that supports more configuration options.

Note that this is a breaking change and the old configuration will no longer work.

Related issue(s)

Closes #44

Checklist

  • Merge base PR
  • Checks pass
  • Update changelog no changelog, might be good to start one

augur subsample is built to replace the simple augur filter call with a
command that supports more configuration options.

Note that this is a breaking change and the old configuration will no
longer work.
@victorlin victorlin self-assigned this Oct 2, 2025
@victorlin
Copy link
Member Author

Locally, I did a test run with the full input and confirmed the output messages are the same as before with an extra validation line:

Validating schema of 'results/run_config.yaml'...
Sampling at 4000 per group.
13741 strains were dropped during filtering
        4 were dropped because they were in /nextstrain/build/phylogenetic/defaults/global/exclude.txt
        13279 were dropped because they were shorter than the minimum length of 8000bp when only counting standard nucleotide characters A, C, G, or T (case-insensitive)
        74 were dropped during grouping due to ambiguous year information
        384 were dropped during grouping due to ambiguous month information
        2 were added back because they were in /nextstrain/build/phylogenetic/defaults/global/include.txt
        2 were dropped because of subsampling criteria
885 strains passed all filters

@victorlin
Copy link
Member Author

Also confirmed that nextstrain run works:

$ nextstrain setup mumps@victorlin/use-augur-subsample
$ nextstrain run mumps@victorlin/use-augur-subsample phylogenetic out

…
Validating schema of 'results/run_config.yaml'...
Sampling at 4000 per group.
13877 strains were dropped during filtering
        45 were dropped because they were in /nextstrain/pathogen/phylogenetic/defaults/north-america/exclude.txt
        7165 were filtered out by the query: "region=='North America' & (MuV_genotype=='G')"
        6343 were dropped because they were shorter than the minimum length of 8000bp when only counting standard nucleotide characters A, C, G, or T (case-insensitive)
        1 was dropped during grouping due to ambiguous year information
        323 were dropped during grouping due to ambiguous month information
        0 were dropped because of subsampling criteria
749 strains passed all filters
…

Base automatically changed from victorlin/prep-augur-subsample to main October 2, 2025 20:44
Copy link
Contributor

@j23414 j23414 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving for:

  • concordance with WNV sampling (also being used by WA-DOH)
  • these changes do not break anything in the existing pipeline
  • the mumps pipeline is simple enough (only two) that the sampling config is not unreasonably long or tedious to edit

Local testing of the following does indeed work:

mkdir phylo-out
nextstrain run mumps@victorlin/use-augur-subsample phylogenetic phylo-out

north-america:
samples:
all:
min_length: 8000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[non-blocking commentary]
Okay, I can see how this approach makes every filter parameter explicit in the config.yaml. I'm still not a fan of the duplication (e.g., identical settings for min_length, group_by, max_sequences, etc.) under each build scope (north-america and global). But since there are only two builds here (compared to dengue with 10, norovirus with 14, and flu potentially with 3×8), I can live with it for mumps.

I could see an improvement by using something like YAML anchors at some point in the future:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could imagine someone adding "SH" gene builds (would double the builds to 4) but as of now, we only do full genome.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After using YAML anchors in nextstrain/rsv#103, I don't think it's great and am seeking alternatives in nextstrain/public#27.

metadata = "data/metadata.tsv",
exclude = resolve_config_path(config["filter"]["exclude"]),
include = resolve_config_path(config["filter"]["include"]),
config = "results/run_config.yaml",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[non-blocking commentary]
Will this no longer detect changes to the exclude.txt or include.txt files?

Perhaps this is a moot point — I’m not sure how often people simply update the exclude list and expect to rerun from cache, versus just forcing a rerun each time anyway.

Copy link
Member Author

@victorlin victorlin Oct 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it will no longer detect changes, this is a good point. I mentioned this in nextstrain/pathogen-repo-guide#98, and I think the functionality could potentially be brought back with a workflow-level helper function that dynamically adds inputs.

@j23414
Copy link
Contributor

j23414 commented Oct 3, 2025

[non-blocking optional task]

Could also update subsampling in the nextclade workflow for concordance to phylogenetic

filter:
exclude: "{build}/exclude.txt"
include: "{build}/include.txt"
sh: '--exclude-all'
genome: --subsample-max-sequences 300 --min-date 1950 --group-by MuV_genotype --min-length 12000 --exclude-where clade_membership=''

@victorlin
Copy link
Member Author

I'll wait for a decision in nextstrain/public#27 before continuing here.

@victorlin victorlin marked this pull request as draft October 7, 2025 02:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Use augur subsample

3 participants