Register workflow files in `nextstrain-pathogen.yaml` #462

joverlee521 · 2025-07-23T22:38:57Z

Description of proposed changes

Allow maintainers to define paths to each workflow's Snakefile and configfile in the nextstrain-pathogen.yaml. Using mpox as an example, this allows the named workflows to use the shared Snakefile and their own custom configfiles without needing to create separate Snakefiles for each workflow.

---
workflows:
  ingest:
    compatibility:
      nextstrain run: True
  phylogenetic/all-clades:
    compatibility:
      nextstrain run: True
    snakefile: phylogenetic/Snakefile
    configfile: phylogenetic/defaults/mpxv/config.yaml
  phylogenetic/clade-I:
    compatibility:
      nextstrain run: True
    snakefile: phylogenetic/Snakefile
    configfile: phylogenetic/defaults/clade-i/config.yaml
  phylogenetic/clade-IIb:
    compatibility:
      nextstrain run: True
    snakefile: phylogenetic/Snakefile
    configfile: phylogenetic/defaults/hmpxv1/config.yaml
  phylogenetic/lineage-B.1:
    compatibility:
      nextstrain run: True
    snakefile: phylogenetic/Snakefile
    configfile: phylogenetic/defaults/hmpxv1_big/config.yaml
  nextclade:
    compatibility:
      nextstrain run: False

Try testing this with

Install CLI from this PR

curl -fsSL --proto '=https' https://nextstrain.org/cli/installer/mac | bash -s pr-build/462

Setup mpox from corresponding PR

nextstrain setup mpox@nextstrain-run-workflows

Run any of the list phylo workflows, e.g.

nextstrain run mpox@nextstrain-run-workflows phylogenetic/clade-I /tmp/mpox-clade-i

Related issue(s)

Related to #454
Related to nextstrain/mpox#330

Checklist

Checks pass
Update changelog

joverlee521 · 2025-07-23T23:00:01Z

nextstrain/cli/command/run.py

+        *(["--configfile=%s" % (resolved_configfile)]
+            if resolved_configfile else []),


OH, just realized this overrides the custom config in the analysis directory...will think on this...

Yeah I remember it took me a bit of digging figure out snakemake's config handling. I was alluding to this here, but a much more thorough comment is this one from the original avian-flu PR. It all makes sense once you figure it out, but it's confusing before that because configfile: does more than you (I!) thought it does.

Put another way, I think --configfile base overlay or

configfile: base configfile: overlay

or configfile: base + --configfile overlay will all work; thinking through how the third example actually works, and what that says about what configfile: is doing behind the scenes was the 💡 moment for me. (The second example is how we run external analysis directories with a default config.yaml overlay, and the third is the documented way to run external analysis directories with a differently-named overlay YAML.) But --configfile base plus configfile: overlay doesn't work, and I don't think it can be made to work in a nice way. (I think that's what this PR is trying to do, but maybe not?)

As a thought experiment, I think the runner (via nextstrain-pathogen.yaml) could define --config base_configfile: base and then in the snakefile we leverage that to do something like

if 'base_configfile' in config: configfile: path.join(repo_directory, config['base_configfile' ])

but I'm sure there's a nicer way.

Ah thanks for the pointers @jameshadfield!

I was reading through the Snakemake config docs again and noticed that there is also an API method for overwriting configs (snakemake.utils.update_config). As another experiment, I thought we could replace how we incorporate the overlay config:

diff --git a/phylogenetic/Snakefile b/phylogenetic/Snakefile index 57a9da9474..399e6eef9f 100644 --- a/phylogenetic/Snakefile +++ b/phylogenetic/Snakefile @@ -16,7 +16,9 @@ # Use custom configuration from analysis directory (i.e. working dir), if any. if os.path.exists("config.yaml"): - configfile: "config.yaml" + with open("config.yaml", "r", encoding = "utf-8") as f: + custom_config = yaml.safe_load(f) + snakemake.utils.update_config(config, custom_config)

Unfortunately, this means the overlay also overrides any --config, which is not an issue for nextstrain run, but is unexpected behavior for directly running snakemake or nextstrain build...

As a thought experiment, I think the runner (via nextstrain-pathogen.yaml) could define --config base_configfile: base and then in the snakefile we leverage that to do something like

if 'base_configfile' in config: configfile: path.join(repo_directory, config['base_configfile' ])

Tried this out locally and it works as expected if it's defined before configfile: overlay. So for our workflows the config merging order will be

--config base_configfile=... < configfile: overlay < --configfile < --config

Not sure how I feel about that special case for base_configfile yet...

Yeah, that seems... fine. I think! It's not nextstrain run specific per-se as you can run via snakemake --configfile base overlay pretty easily.

And this should work nicely with custom-named config overlays as you can add --configfile overlay.yaml (if nextstrain run allows you to add your own snakemake args?!?)

Would you follow this up with the removal of lines such as these in measles?

It's not nextstrain run specific per-se as you can run via snakemake --configfile base overlay pretty easily.

Ah, I meant implemented specifically for nextstrain run here since this change shouldn't affect how workflows run for snakemake or nextstrain build calls.

And this should work nicely with custom-named config overlays as you can add --configfile overlay.yaml (if nextstrain run allows you to add your own snakemake args?!?)

As of right now, nextstrain run does not allow you to pass in additional snakemake CLI args (maybe it would work with a default Snakemake profile?)

Would you follow this up with the removal of lines such as these in measles?

I think so? Unless we expect the config.yaml in the analysis directory to get automatically used when running with snakemake or nextstrain build.

Unless we expect the config.yaml in the analysis directory to get automatically used when running with snakemake or nextstrain build.

We have two different scenarios:

There's clearly a default configfile (e.g. measles, avian-flu)

There's no one default, you have to select it somehow (e.g. mpox)

Outside of nextstrain run I'm not sure of the best behaviour for (1) - should you be forced to supply the default --configfile yourself, or can the Snakefile (or config.smk etc) do that for you as it currently does. (Note that you already have to supply --snakefile ....) I guess consistency across (1) and (2) would be nice.

Yeah, probably best to write out how all of the different methods of running the workflow will look...

For (1) - a pathogen that does have a Snakemake code to load default config and automatically use the analysis config overlay such as measles, the commands are pretty simple:

snakemake \ --snakefile phylogenetic/Snakefile \ --directory /tmp/measles-phylo

# Hypothetical custom build config within phylogenetic/build-configs/overlay nextstrain build phylogenetic \ --directory build-configs/overlay/ \

The nextstrain-pathogen.yaml will be simple for nextstrain run since it does not need to define the snakefile or config files:

--- compatibility: nextstrain run: ingest: ~ phylogenetic: ~ nextclade: ~

For (2) - a pathogen without a default config, the automatic use of analysis config overlay does not work as expected. There will be no Snakemake code to define configfile, so we must always specify both the base config and overlay via the --configfile option:

snakemake \ --snakefile phylogenetic/Snakefile \ --directory /tmp/mpox-phylo-clade-i \ --configfile phylogenetic/defaults/clade-i/config.yaml \ --configfile /tmp/mpox-phylo-clade-i/config.yaml

# Hypothetical custom build config within phylogenetic/build-configs/overlay nextstrain build phylogenetic \ --directory build-configs/overlay/ \ --configfile defaults/clade-i/config.yaml \ --configfile build-configs/overlay/config.yaml

The nextstrain-pathogen.yaml can will define the snakefile and configfiles for nextstrain run:

--- compatibility: nextstrain run: ingest: ~ phylogenetic/all-clades: snakefile: phylogenetic/Snakefile configfile: phylogenetic/defaults/mpxv/config.yaml phylogenetic/clade-I: snakefile: phylogenetic/Snakefile configfile: phylogenetic/defaults/clade-i/config.yaml phylogenetic/clade-IIb: snakefile: phylogenetic/Snakefile configfile: phylogenetic/defaults/hmpxv1/config.yaml phylogenetic/lineage-B.1: snakefile: phylogenetic/Snakefile configfile: phylogenetic/defaults/hmpxv1_big/config.yaml

If we remove the default base config and automatic use of analysis config overlays, then the commands for measles would be more verbose and consistent with mpox. However, after writing this out, I'm not sure that consistency is worth it, especially since the --configfile for the base config will always be the same for measles...

Thanks! Can you add to that comment what the nextstrain-pathogen.yaml would look like and if the workflow uses snakemake code to load the default/config overlays. I'm guessing measles has a YAML which doesn't define any configfiles and does have snakemake code like this, and (2) is the opposite?

…config Because of the order in which Snakemake merges configs, we must also provide the user's config from their analysis directory to override workflow configs provided via the `--configfile` option. See detailed discussion in <#462 (comment)>

Use the `nextstrain-pathogen.yaml` to list the available phylogenetic workflows for `nextstrain run`. Requires changes from <nextstrain/cli#462> to properly use the correct Snakefile and configfile for each workflow.

Because of the order in which Snakemake merges configs, we must also provide the user's config from their analysis directory to override workflow configs provided via the `--configfile` option. See detailed discussion in <#462 (comment)>

nextstrain/cli/pathogens.py

@victorlin

List workflows under a top level `workflows` key in the `nextstrain-pathogen.yaml` file instead of listing workflows under `compatibility['nextstrain run']`. Suggested by @victorlin <#462 (comment)>

@victorlin

List compatible workflows under a top level `nextstrain_run_workflows` key in the `nextstrain-pathogen.yaml` file instead of listing workflows under `compatibility['nextstrain run']`. The `compatibility['nextstrain run']` boolean is still supported for backwards compatibility. From discussion with @victorlin in <#462 (comment)>

Because of the order in which Snakemake merges configs, we must also provide the user's config from their analysis directory to override workflow configs provided via the `--configfile` option. See detailed discussion in <#462 (comment)>

@victorlin

List workflows under a top level `workflows` key in the `nextstrain-pathogen.yaml` file where each workflow has it's own `compatibility` key: ```yaml workflows: ingest: compatibility: nextstrain run: True phylogenetic/all-clades: compatibility: nextstrain run: True snakefile: phylogenetic/Snakefile configfile: phylogenetic/defaults/mpxv/config.yaml ``` The top level `compatibility['nextstrain run']` boolean is still supported for backwards compatibility. From discussion with @victorlin in <#462 (comment)>

nextstrain/cli/command/run.py

victorlin · 2025-08-14T18:05:52Z

CHANGES.md

 ## Improvements

+* `nextstrain run` supports running workflows with defined Snakefiles and
+  configfiles in the `nextstrain-pathogen.yaml` file. This is mainly relevant


(non-blocking) There's no documentation for the nextstrain-pathogen.yaml schema yet, but it might be good to add soon.

For sure, I was going to add the schema next. I'm thinking I'll stick with jsonschema that we already use for Augur.

tsibley · 2025-09-09T20:48:26Z

I think this specifying of the Snakefile and configfile (the "entrypoint") for each workflow in nextstrain-pathogen.yaml isn't the right direction. Defining the entrypoint with a directory (containing a Snakefile and a configfile) seems to me to be easier to use, discover, and understand without having to know how nextstrain run works internally. Using a directory allows the pathogen workflows to have lots of flexibility with implementation while still presenting a consistent interface (the directory name) at all levels (source tree, snakemake, nextstrain build, nextstrain run) instead of presenting one interface to nextstrain run and another to the other levels.

I'm not sure I've grokked all the hurdles encountered in nextstrain/mpox#327, but my understanding is they pushed you towards nextstrain/mpox#314 and this solution instead? Perhaps we can discuss the hurdles and options around them?

tsibley · 2025-09-09T20:51:35Z

That said, +1 for supporting per-workflow registration metadata declared in nextstrain-pathogen.yaml as:

---
workflows:
  ingest:
    compatibility:
      nextstrain run: True
  phylogenetic/all-clades:
    compatibility:
      nextstrain run: True
    other stuff: …

joverlee521 · 2025-09-10T17:06:13Z

That said, +1 for supporting per-workflow registration metadata declared in nextstrain-pathogen.yaml as:

👍 I'll pull this out as a separate change as a follow up to #461.

I think this specifying of the Snakefile and configfile (the "entrypoint") for each workflow in nextstrain-pathogen.yaml isn't the right direction. Defining the entrypoint with a directory (containing a Snakefile and a configfile) seems to me to be easier to use, discover, and understand without having to know how nextstrain run works internally. Using a directory allows the pathogen workflows to have lots of flexibility with implementation while still presenting a consistent interface (the directory name) at all levels (source tree, snakemake, nextstrain build, nextstrain run) instead of presenting one interface to nextstrain run and another to the other levels.

I'm not sure I've grokked all the hurdles encountered in nextstrain/mpox#327, but my understanding is they pushed you towards nextstrain/mpox#314 and this solution instead? Perhaps we can discuss the hurdles and options around them?

I think the directory structure makes sense for workflows that are completely separate workflows (e.g. ingest vs phylo), but not as clear for workflows that essentially running the exact same Snakemake workflow with different configs like mpox phylo. Basically, I was not a fan of the nested directory structure that I landed on in nextstrain/mpox#327, especially when trying to figure out the colocation/separation of config files. Maybe this is more of reflection on the structure of the mpox phylo workflow rather than the nextstrain run command, i.e. this would not be an issue if the mpox phylo config used the multi-build schema.

tsibley · 2025-09-10T18:56:13Z

I think the directory structure makes sense for workflows that are completely separate workflows (e.g. ingest vs phylo), but not as clear for workflows that essentially running the exact same Snakemake workflow with different configs like mpox phylo. […] Maybe this is more of reflection on the structure of the mpox phylo workflow rather than the nextstrain run command, i.e. this would not be an issue if the mpox phylo config used the multi-build schema.

Maybe this is the crucial point: I think the pattern we should use for our workflows is that each workflow has a name (ingest, phylo, phylo/foo, etc.) and a single default/stock config and set of default data files. If we want an additional default/stock config or sets of default data, then we give that a first-class name as conceptually a new workflow (e.g. phylo/bar), even if under the hood it reuses/shares bits of code/data. A workflow may still be multi-build, i.e. produce multiple distinct output sets from a single config/input data set, and this may be more appropriate for some situations than others. Choosing between multiple workflows vs. a single multi-build workflow is left up to the pathogen repo authors.

The idea here is that providing multiple default/stock configs and associated data quickly becomes tangled and there's not a single name to refer to that collection of things. It's more explicit and conceptually cleaner (if more verbose at times) to give that default config a name and then collect everything it needs under that name. This keeps the interface simpler. In my view, the use of multiple default/stock configs like in mpox was an attempt to give each variant of the workflow a name and we should follow thru with that more comprehensively.

(Perhaps mpox wants to transition to a single multi-build workflow if we feel like we have a better grasp on the interface for those now, e.g. with globbing in config keys, but it wouldn't have to.)

Basically, I was not a fan of the nested directory structure that I landed on in nextstrain/mpox#327, especially when trying to figure out the colocation/separation of config files.

If we want to keep mpox as not a multi-build workflow, then I think we can work on ways to improve the directory structure there. I'd suggest starting out by sharing less, e.g. doing away with direct config/workflow access to phylogenetic/defaults/ and instead consistently expecting files in phylogenetic/foo/defaults/. We can symlink if we want to share files.

@victorlin

List workflows under a top level `workflows` key in the `nextstrain-pathogen.yaml` file where each workflow has it's own `compatibility` key: ```yaml workflows: ingest: compatibility: nextstrain run: True phylogenetic: compatibility: nextstrain run: True ``` The top level `compatibility['nextstrain run']` boolean is still supported for backwards compatibility. From discussion with @victorlin in <#462 (comment)>

@victorlin

List workflows under a top level `workflows` key in the `nextstrain-pathogen.yaml` file where each workflow has it's own `compatibility` key: ```yaml workflows: ingest: compatibility: nextstrain run: True phylogenetic: compatibility: nextstrain run: True ``` The top level `compatibility['nextstrain run']` boolean is still supported for backwards compatibility. From discussion with @victorlin in <#462 (comment)>

Allow maintainers to define paths to each workflow's Snakefile and configfile in the nextstrain-pathogen.yaml. Using mpox as an example, this allows the named workflows to use the shared Snakefile and their own custom configfiles without needing to create separate Snakefiles for each workflow. ``` --- compatibility: nextstrain run: ingest: ~ phylogenetic/all-clades: snakefile: phylogenetic/Snakefile configfile: phylogenetic/defaults/mpxv/config.yaml phylogenetic/clade-I: snakefile: phylogenetic/Snakefile configfile: phylogenetic/defaults/clade-i/config.yaml phylogenetic/clade-IIb: snakefile: phylogenetic/Snakefile configfile: phylogenetic/defaults/hmpxv1/config.yaml phylogenetic/lineage-B.1: snakefile: phylogenetic/Snakefile configfile: phylogenetic/deafults/hmpxv1_big/config.yaml ```

Because of the order in which Snakemake merges configs, we must also provide the user's config from their analysis directory to override workflow configs provided via the `--configfile` option. See detailed discussion in <#462 (comment)>

@victorlin

…run` compatibility Workflows are now registered under a top level "workflows" key in the nextstrain-pathogen.yaml file, and each workflow has its own "compatibility" declarations. The only current "compatibility" declaration remains "nextstrain run". This allows for more meta information about each workflow to be registered in one place not under the banner of compatibility. The nextstrain-pathogen.yaml schema changes to a workflow-first structure like: workflows: ingest: compatibility: nextstrain run: yes phylogenetic: compatibility: nextstrain run: no instead of a compatibility-first structure like: compatibility: nextstrain run: ingest: yes phylogenetic: no as previously introduced¹ (but not yet released). As it was never released, don't bother supporting the latter schema at all anymore. Change first arose out of discussion started by @victorlin in a related PR.² Subsequent discussion with @tsibley³ led to dropping support for the unreleased schema. ¹ <#461> ² <#462 (comment)> ³ <#472 (comment)> Co-authored-by: Thomas Sibley <tsibley@fredhutch.org>

@victorlin

…run` compatibility Workflows are now registered under a top level "workflows" key in the nextstrain-pathogen.yaml file, and each workflow has its own "compatibility" declarations. The only current "compatibility" declaration remains "nextstrain run". This allows for more meta information about each workflow to be registered in one place not under the banner of compatibility. The nextstrain-pathogen.yaml schema changes to a workflow-first structure like: workflows: ingest: compatibility: nextstrain run: yes phylogenetic: compatibility: nextstrain run: no instead of a compatibility-first structure like: compatibility: nextstrain run: ingest: yes phylogenetic: no as previously introduced¹ (but not yet released). As it was never released, don't bother supporting the latter schema at all anymore. Change first arose out of discussion started by @victorlin in a related PR.² Subsequent discussion with @tsibley³ led to dropping support for the unreleased schema. ¹ <#461> ² <#462 (comment)> ³ <#472 (comment)> Co-authored-by: Thomas Sibley <tsibley@fredhutch.org>

@victorlin

…run` compatibility Workflows are now registered under a top level "workflows" key in the nextstrain-pathogen.yaml file, and each workflow has its own "compatibility" declarations. The only current "compatibility" declaration remains "nextstrain run". This allows for more meta information about each workflow to be registered in one place not under the banner of compatibility. The nextstrain-pathogen.yaml schema changes to a workflow-first structure like: workflows: ingest: compatibility: nextstrain run: yes phylogenetic: compatibility: nextstrain run: no instead of a compatibility-first structure like: compatibility: nextstrain run: ingest: yes phylogenetic: no as previously introduced¹ (but not yet released). As it was never released, don't bother supporting the latter schema at all anymore. Change first arose out of discussion started by @victorlin in a related PR.² Subsequent discussion with @tsibley³ led to dropping support for the unreleased schema. ¹ <#461> ² <#462 (comment)> ³ <#472 (comment)> Co-authored-by: Thomas Sibley <tsibley@fredhutch.org>

joverlee521 mentioned this pull request Jul 23, 2025

phylogenetic: Support nextstrain run (alternative) nextstrain/mpox#330

Draft

2 tasks

joverlee521 commented Jul 23, 2025

View reviewed changes

joverlee521 force-pushed the register-workflow-files branch from b14e591 to 81f6365 Compare July 30, 2025 18:37

Base automatically changed from register-pathogen-workflows to master July 30, 2025 21:21

joverlee521 force-pushed the register-workflow-files branch from 81f6365 to a216165 Compare July 30, 2025 21:27

joverlee521 marked this pull request as ready for review July 30, 2025 21:27

victorlin reviewed Aug 1, 2025

View reviewed changes

nextstrain/cli/pathogens.py Show resolved Hide resolved

joverlee521 force-pushed the register-workflow-files branch from a216165 to f04761d Compare August 8, 2025 22:59

victorlin approved these changes Aug 14, 2025

View reviewed changes

joverlee521 mentioned this pull request Aug 14, 2025

Add schema for nextstrain-pathogen.yaml #466

Closed

joverlee521 mentioned this pull request Sep 10, 2025

pathogens: Define workflows in top level key of nextstrain-pathogen.yaml #472

Merged

5 tasks

joverlee521 added 4 commits September 15, 2025 13:13

Update changelog

1e4a963

joverlee521 force-pushed the register-workflow-files branch from f04761d to 1e4a963 Compare September 15, 2025 22:14

joverlee521 changed the base branch from master to workflow-compatibility September 15, 2025 22:14

joverlee521 marked this pull request as draft September 15, 2025 22:14

tsibley force-pushed the workflow-compatibility branch 2 times, most recently from fad921c to 8e98a95 Compare September 24, 2025 21:16

tsibley force-pushed the workflow-compatibility branch from 8e98a95 to 20c99d4 Compare September 24, 2025 22:20

Base automatically changed from workflow-compatibility to master September 25, 2025 16:09

		*(["--configfile=%s" % (resolved_configfile)]
		if resolved_configfile else []),

Register workflow files in nextstrain-pathogen.yaml #462

Are you sure you want to change the base?

Register workflow files in nextstrain-pathogen.yaml #462

Uh oh!

Conversation

joverlee521 commented Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of proposed changes

Related issue(s)

Checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jameshadfield Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jameshadfield Jul 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jameshadfield Jul 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

joverlee521 Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tsibley commented Sep 9, 2025

Uh oh!

tsibley commented Sep 9, 2025

Uh oh!

joverlee521 commented Sep 10, 2025

Uh oh!

tsibley commented Sep 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Register workflow files in `nextstrain-pathogen.yaml` #462

Register workflow files in `nextstrain-pathogen.yaml` #462

joverlee521 commented Jul 23, 2025 •

edited

Loading

jameshadfield Jul 24, 2025 •

edited

Loading

jameshadfield Jul 25, 2025 •

edited

Loading

jameshadfield Jul 27, 2025 •

edited

Loading

joverlee521 Jul 29, 2025 •

edited

Loading