Skip to content

Conversation

@joverlee521
Copy link
Contributor

@joverlee521 joverlee521 commented Jul 23, 2025

Description of proposed changes

Allow maintainers to define paths to each workflow's Snakefile and configfile in the nextstrain-pathogen.yaml. Using mpox as an example, this allows the named workflows to use the shared Snakefile and their own custom configfiles without needing to create separate Snakefiles for each workflow.

---
workflows:
  ingest:
    compatibility:
      nextstrain run: True
  phylogenetic/all-clades:
    compatibility:
      nextstrain run: True
    snakefile: phylogenetic/Snakefile
    configfile: phylogenetic/defaults/mpxv/config.yaml
  phylogenetic/clade-I:
    compatibility:
      nextstrain run: True
    snakefile: phylogenetic/Snakefile
    configfile: phylogenetic/defaults/clade-i/config.yaml
  phylogenetic/clade-IIb:
    compatibility:
      nextstrain run: True
    snakefile: phylogenetic/Snakefile
    configfile: phylogenetic/defaults/hmpxv1/config.yaml
  phylogenetic/lineage-B.1:
    compatibility:
      nextstrain run: True
    snakefile: phylogenetic/Snakefile
    configfile: phylogenetic/defaults/hmpxv1_big/config.yaml
  nextclade:
    compatibility:
      nextstrain run: False

Try testing this with

  1. Install CLI from this PR
curl -fsSL --proto '=https' https://nextstrain.org/cli/installer/mac | bash -s pr-build/462
  1. Setup mpox from corresponding PR
nextstrain setup mpox@nextstrain-run-workflows
  1. Run any of the list phylo workflows, e.g.
nextstrain run mpox@nextstrain-run-workflows phylogenetic/clade-I /tmp/mpox-clade-i

Related issue(s)

Related to #454
Related to nextstrain/mpox#330

Checklist

  • Checks pass
  • Update changelog

Comment on lines +296 to +306
*(["--configfile=%s" % (resolved_configfile)]
if resolved_configfile else []),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OH, just realized this overrides the custom config in the analysis directory...will think on this...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I remember it took me a bit of digging figure out snakemake's config handling. I was alluding to this here, but a much more thorough comment is this one from the original avian-flu PR. It all makes sense once you figure it out, but it's confusing before that because configfile: does more than you (I!) thought it does.

Copy link
Member

@jameshadfield jameshadfield Jul 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put another way, I think --configfile base overlay or

configfile: base
configfile: overlay

or configfile: base + --configfile overlay will all work; thinking through how the third example actually works, and what that says about what configfile: is doing behind the scenes was the 💡 moment for me. (The second example is how we run external analysis directories with a default config.yaml overlay, and the third is the documented way to run external analysis directories with a differently-named overlay YAML.) But --configfile base plus configfile: overlay doesn't work, and I don't think it can be made to work in a nice way. (I think that's what this PR is trying to do, but maybe not?)

As a thought experiment, I think the runner (via nextstrain-pathogen.yaml) could define --config base_configfile: base and then in the snakefile we leverage that to do something like

if 'base_configfile' in config:
  configfile: path.join(repo_directory, config['base_configfile' ])

but I'm sure there's a nicer way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah thanks for the pointers @jameshadfield!

I was reading through the Snakemake config docs again and noticed that there is also an API method for overwriting configs (snakemake.utils.update_config). As another experiment, I thought we could replace how we incorporate the overlay config:

diff --git a/phylogenetic/Snakefile b/phylogenetic/Snakefile
index 57a9da9474..399e6eef9f 100644
--- a/phylogenetic/Snakefile
+++ b/phylogenetic/Snakefile
@@ -16,7 +16,9 @@
 
 # Use custom configuration from analysis directory (i.e. working dir), if any.
 if os.path.exists("config.yaml"):
-    configfile: "config.yaml"
+    with open("config.yaml", "r", encoding = "utf-8") as f:
+        custom_config = yaml.safe_load(f)
+        snakemake.utils.update_config(config, custom_config)

Unfortunately, this means the overlay also overrides any --config, which is not an issue for nextstrain run, but is unexpected behavior for directly running snakemake or nextstrain build...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a thought experiment, I think the runner (via nextstrain-pathogen.yaml) could define --config base_configfile: base and then in the snakefile we leverage that to do something like

if 'base_configfile' in config:
  configfile: path.join(repo_directory, config['base_configfile' ])

Tried this out locally and it works as expected if it's defined before configfile: overlay. So for our workflows the config merging order will be

--config base_configfile=... < configfile: overlay < --configfile < --config

Not sure how I feel about that special case for base_configfile yet...

Copy link
Member

@jameshadfield jameshadfield Jul 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that seems... fine. I think! It's not nextstrain run specific per-se as you can run via snakemake --configfile base overlay pretty easily.

And this should work nicely with custom-named config overlays as you can add --configfile overlay.yaml (if nextstrain run allows you to add your own snakemake args?!?)

Would you follow this up with the removal of lines such as these in measles?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not nextstrain run specific per-se as you can run via snakemake --configfile base overlay pretty easily.

Ah, I meant implemented specifically for nextstrain run here since this change shouldn't affect how workflows run for snakemake or nextstrain build calls.

And this should work nicely with custom-named config overlays as you can add --configfile overlay.yaml (if nextstrain run allows you to add your own snakemake args?!?)

As of right now, nextstrain run does not allow you to pass in additional snakemake CLI args (maybe it would work with a default Snakemake profile?)

Would you follow this up with the removal of lines such as these in measles?

I think so? Unless we expect the config.yaml in the analysis directory to get automatically used when running with snakemake or nextstrain build.

Copy link
Member

@jameshadfield jameshadfield Jul 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless we expect the config.yaml in the analysis directory to get automatically used when running with snakemake or nextstrain build.

We have two different scenarios:

  1. There's clearly a default configfile (e.g. measles, avian-flu)
  2. There's no one default, you have to select it somehow (e.g. mpox)

Outside of nextstrain run I'm not sure of the best behaviour for (1) - should you be forced to supply the default --configfile yourself, or can the Snakefile (or config.smk etc) do that for you as it currently does. (Note that you already have to supply --snakefile ....) I guess consistency across (1) and (2) would be nice.

Copy link
Contributor Author

@joverlee521 joverlee521 Jul 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, probably best to write out how all of the different methods of running the workflow will look...

For (1) - a pathogen that does have a Snakemake code to load default config and automatically use the analysis config overlay such as measles, the commands are pretty simple:

snakemake \
    --snakefile phylogenetic/Snakefile \
    --directory /tmp/measles-phylo
# Hypothetical custom build config within phylogenetic/build-configs/overlay
nextstrain build phylogenetic \
    --directory build-configs/overlay/ \

The nextstrain-pathogen.yaml will be simple for nextstrain run since it does not need to define the snakefile or config files:

---
compatibility:
  nextstrain run:
    ingest: ~
    phylogenetic: ~
    nextclade: ~

For (2) - a pathogen without a default config, the automatic use of analysis config overlay does not work as expected. There will be no Snakemake code to define configfile, so we must always specify both the base config and overlay via the --configfile option:

snakemake \
    --snakefile phylogenetic/Snakefile \
    --directory /tmp/mpox-phylo-clade-i \
    --configfile phylogenetic/defaults/clade-i/config.yaml \
    --configfile /tmp/mpox-phylo-clade-i/config.yaml
# Hypothetical custom build config within phylogenetic/build-configs/overlay
nextstrain build phylogenetic \
    --directory build-configs/overlay/ \
    --configfile defaults/clade-i/config.yaml \
    --configfile build-configs/overlay/config.yaml

The nextstrain-pathogen.yaml can will define the snakefile and configfiles for nextstrain run:

---
compatibility:
  nextstrain run:
    ingest: ~
    phylogenetic/all-clades:
        snakefile: phylogenetic/Snakefile
        configfile: phylogenetic/defaults/mpxv/config.yaml
    phylogenetic/clade-I:
        snakefile: phylogenetic/Snakefile
        configfile: phylogenetic/defaults/clade-i/config.yaml
    phylogenetic/clade-IIb:
        snakefile: phylogenetic/Snakefile
        configfile: phylogenetic/defaults/hmpxv1/config.yaml
    phylogenetic/lineage-B.1:
        snakefile: phylogenetic/Snakefile
        configfile: phylogenetic/defaults/hmpxv1_big/config.yaml

If we remove the default base config and automatic use of analysis config overlays, then the commands for measles would be more verbose and consistent with mpox. However, after writing this out, I'm not sure that consistency is worth it, especially since the --configfile for the base config will always be the same for measles...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Can you add to that comment what the nextstrain-pathogen.yaml would look like and if the workflow uses snakemake code to load the default/config overlays. I'm guessing measles has a YAML which doesn't define any configfiles and does have snakemake code like this, and (2) is the opposite?

joverlee521 added a commit that referenced this pull request Jul 24, 2025
…config

Because of the order in which Snakemake merges configs, we must also provide the 
user's config from their analysis directory to override workflow configs
provided via the `--configfile` option.

See detailed discussion in 
<#462 (comment)>
joverlee521 added a commit to nextstrain/mpox that referenced this pull request Jul 30, 2025
Use the `nextstrain-pathogen.yaml` to list the available phylogenetic 
workflows for `nextstrain run`. 

Requires changes from <nextstrain/cli#462>
to properly use the correct Snakefile and configfile for each workflow.
joverlee521 added a commit that referenced this pull request Jul 30, 2025
Because of the order in which Snakemake merges configs, we must also provide the 
user's config from their analysis directory to override workflow configs
provided via the `--configfile` option.

See detailed discussion in 
<#462 (comment)>
@joverlee521 joverlee521 force-pushed the register-workflow-files branch from b14e591 to 81f6365 Compare July 30, 2025 18:37
Base automatically changed from register-pathogen-workflows to master July 30, 2025 21:21
joverlee521 added a commit that referenced this pull request Jul 30, 2025
Because of the order in which Snakemake merges configs, we must also provide the 
user's config from their analysis directory to override workflow configs
provided via the `--configfile` option.

See detailed discussion in 
<#462 (comment)>
@joverlee521 joverlee521 force-pushed the register-workflow-files branch from 81f6365 to a216165 Compare July 30, 2025 21:27
@joverlee521 joverlee521 marked this pull request as ready for review July 30, 2025 21:27
joverlee521 added a commit that referenced this pull request Aug 1, 2025
List workflows under a top level `workflows` key in the `nextstrain-pathogen.yaml`
file instead of listing workflows under `compatibility['nextstrain run']`. 

Suggested by @victorlin 
<#462 (comment)>
joverlee521 added a commit that referenced this pull request Aug 4, 2025
List compatible workflows under a top level `nextstrain_run_workflows` key in 
the  `nextstrain-pathogen.yaml` file instead of listing workflows under 
`compatibility['nextstrain run']`. The `compatibility['nextstrain run']` 
boolean is still supported for backwards compatibility. 

From discussion with @victorlin in
<#462 (comment)>
joverlee521 added a commit that referenced this pull request Aug 8, 2025
Because of the order in which Snakemake merges configs, we must also provide the 
user's config from their analysis directory to override workflow configs
provided via the `--configfile` option.

See detailed discussion in 
<#462 (comment)>
joverlee521 added a commit that referenced this pull request Aug 8, 2025
List workflows under a top level `workflows` key in the `nextstrain-pathogen.yaml` 
file where each workflow has it's own `compatibility` key:

```yaml
workflows:
  ingest:
    compatibility:
      nextstrain run: True
  phylogenetic/all-clades:
    compatibility:
      nextstrain run: True
    snakefile: phylogenetic/Snakefile
    configfile: phylogenetic/defaults/mpxv/config.yaml
```

The top level `compatibility['nextstrain run']` boolean is still supported 
for backwards compatibility.

From discussion with @victorlin in
<#462 (comment)>
@joverlee521 joverlee521 force-pushed the register-workflow-files branch from a216165 to f04761d Compare August 8, 2025 22:59
## Improvements

* `nextstrain run` supports running workflows with defined Snakefiles and
configfiles in the `nextstrain-pathogen.yaml` file. This is mainly relevant
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(non-blocking) There's no documentation for the nextstrain-pathogen.yaml schema yet, but it might be good to add soon.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For sure, I was going to add the schema next. I'm thinking I'll stick with jsonschema that we already use for Augur.

@tsibley
Copy link
Contributor

tsibley commented Sep 9, 2025

I think this specifying of the Snakefile and configfile (the "entrypoint") for each workflow in nextstrain-pathogen.yaml isn't the right direction. Defining the entrypoint with a directory (containing a Snakefile and a configfile) seems to me to be easier to use, discover, and understand without having to know how nextstrain run works internally. Using a directory allows the pathogen workflows to have lots of flexibility with implementation while still presenting a consistent interface (the directory name) at all levels (source tree, snakemake, nextstrain build, nextstrain run) instead of presenting one interface to nextstrain run and another to the other levels.

I'm not sure I've grokked all the hurdles encountered in nextstrain/mpox#327, but my understanding is they pushed you towards nextstrain/mpox#314 and this solution instead? Perhaps we can discuss the hurdles and options around them?

@tsibley
Copy link
Contributor

tsibley commented Sep 9, 2025

That said, +1 for supporting per-workflow registration metadata declared in nextstrain-pathogen.yaml as:

---
workflows:
  ingest:
    compatibility:
      nextstrain run: True
  phylogenetic/all-clades:
    compatibility:
      nextstrain run: True
    other stuff: 

@joverlee521
Copy link
Contributor Author

That said, +1 for supporting per-workflow registration metadata declared in nextstrain-pathogen.yaml as:

👍 I'll pull this out as a separate change as a follow up to #461.

I think this specifying of the Snakefile and configfile (the "entrypoint") for each workflow in nextstrain-pathogen.yaml isn't the right direction. Defining the entrypoint with a directory (containing a Snakefile and a configfile) seems to me to be easier to use, discover, and understand without having to know how nextstrain run works internally. Using a directory allows the pathogen workflows to have lots of flexibility with implementation while still presenting a consistent interface (the directory name) at all levels (source tree, snakemake, nextstrain build, nextstrain run) instead of presenting one interface to nextstrain run and another to the other levels.

I'm not sure I've grokked all the hurdles encountered in nextstrain/mpox#327, but my understanding is they pushed you towards nextstrain/mpox#314 and this solution instead? Perhaps we can discuss the hurdles and options around them?

I think the directory structure makes sense for workflows that are completely separate workflows (e.g. ingest vs phylo), but not as clear for workflows that essentially running the exact same Snakemake workflow with different configs like mpox phylo. Basically, I was not a fan of the nested directory structure that I landed on in nextstrain/mpox#327, especially when trying to figure out the colocation/separation of config files. Maybe this is more of reflection on the structure of the mpox phylo workflow rather than the nextstrain run command, i.e. this would not be an issue if the mpox phylo config used the multi-build schema.

@tsibley
Copy link
Contributor

tsibley commented Sep 10, 2025

I think the directory structure makes sense for workflows that are completely separate workflows (e.g. ingest vs phylo), but not as clear for workflows that essentially running the exact same Snakemake workflow with different configs like mpox phylo. […] Maybe this is more of reflection on the structure of the mpox phylo workflow rather than the nextstrain run command, i.e. this would not be an issue if the mpox phylo config used the multi-build schema.

Maybe this is the crucial point: I think the pattern we should use for our workflows is that each workflow has a name (ingest, phylo, phylo/foo, etc.) and a single default/stock config and set of default data files. If we want an additional default/stock config or sets of default data, then we give that a first-class name as conceptually a new workflow (e.g. phylo/bar), even if under the hood it reuses/shares bits of code/data. A workflow may still be multi-build, i.e. produce multiple distinct output sets from a single config/input data set, and this may be more appropriate for some situations than others. Choosing between multiple workflows vs. a single multi-build workflow is left up to the pathogen repo authors.

The idea here is that providing multiple default/stock configs and associated data quickly becomes tangled and there's not a single name to refer to that collection of things. It's more explicit and conceptually cleaner (if more verbose at times) to give that default config a name and then collect everything it needs under that name. This keeps the interface simpler. In my view, the use of multiple default/stock configs like in mpox was an attempt to give each variant of the workflow a name and we should follow thru with that more comprehensively.

(Perhaps mpox wants to transition to a single multi-build workflow if we feel like we have a better grasp on the interface for those now, e.g. with globbing in config keys, but it wouldn't have to.)

Basically, I was not a fan of the nested directory structure that I landed on in nextstrain/mpox#327, especially when trying to figure out the colocation/separation of config files.

If we want to keep mpox as not a multi-build workflow, then I think we can work on ways to improve the directory structure there. I'd suggest starting out by sharing less, e.g. doing away with direct config/workflow access to phylogenetic/defaults/ and instead consistently expecting files in phylogenetic/foo/defaults/. We can symlink if we want to share files.

joverlee521 added a commit that referenced this pull request Sep 10, 2025
List workflows under a top level `workflows` key in the `nextstrain-pathogen.yaml` 
file where each workflow has it's own `compatibility` key:

```yaml
workflows:
  ingest:
    compatibility:
      nextstrain run: True
  phylogenetic:
    compatibility:
      nextstrain run: True
```

The top level `compatibility['nextstrain run']` boolean is still supported 
for backwards compatibility.

From discussion with @victorlin in
<#462 (comment)>
List workflows under a top level `workflows` key in the `nextstrain-pathogen.yaml` 
file where each workflow has it's own `compatibility` key:

```yaml
workflows:
  ingest:
    compatibility:
      nextstrain run: True
  phylogenetic:
    compatibility:
      nextstrain run: True
```

The top level `compatibility['nextstrain run']` boolean is still supported 
for backwards compatibility.

From discussion with @victorlin in
<#462 (comment)>
Allow maintainers to define paths to each workflow's Snakefile and configfile 
in the nextstrain-pathogen.yaml. Using mpox as an example, this allows the named 
workflows to use the shared Snakefile and their own custom configfiles without 
needing to create separate Snakefiles for each workflow.
```
---
compatibility:
  nextstrain run:
    ingest: ~
    phylogenetic/all-clades:
        snakefile: phylogenetic/Snakefile
        configfile: phylogenetic/defaults/mpxv/config.yaml
    phylogenetic/clade-I:
        snakefile: phylogenetic/Snakefile
        configfile: phylogenetic/defaults/clade-i/config.yaml
    phylogenetic/clade-IIb:
        snakefile: phylogenetic/Snakefile
        configfile: phylogenetic/defaults/hmpxv1/config.yaml
    phylogenetic/lineage-B.1:
        snakefile: phylogenetic/Snakefile
        configfile: phylogenetic/deafults/hmpxv1_big/config.yaml
```
Because of the order in which Snakemake merges configs, we must also provide the 
user's config from their analysis directory to override workflow configs
provided via the `--configfile` option.

See detailed discussion in 
<#462 (comment)>
@joverlee521 joverlee521 force-pushed the register-workflow-files branch from f04761d to 1e4a963 Compare September 15, 2025 22:14
@joverlee521 joverlee521 changed the base branch from master to workflow-compatibility September 15, 2025 22:14
@joverlee521 joverlee521 marked this pull request as draft September 15, 2025 22:14
tsibley added a commit that referenced this pull request Sep 24, 2025
…run` compatibility

Workflows are now registered under a top level "workflows" key in the
nextstrain-pathogen.yaml file, and each workflow has its own
"compatibility" declarations.  The only current "compatibility"
declaration remains "nextstrain run".

This allows for more meta information about each workflow to be
registered in one place not under the banner of compatibility.

The nextstrain-pathogen.yaml schema changes to a workflow-first
structure like:

  workflows:
    ingest:
      compatibility:
        nextstrain run: yes
    phylogenetic:
      compatibility:
        nextstrain run: no

instead of a compatibility-first structure like:

  compatibility:
    nextstrain run:
      ingest: yes
      phylogenetic: no

as previously introduced¹ (but not yet released).  As it was never
released, don't bother supporting the latter schema at all anymore.

Change first arose out of discussion started by @victorlin in a related
PR.²  Subsequent discussion with @tsibley³ led to dropping support for
the unreleased schema.

¹ <#461>
² <#462 (comment)>
³ <#472 (comment)>

Co-authored-by: Thomas Sibley <tsibley@fredhutch.org>
@tsibley tsibley force-pushed the workflow-compatibility branch 2 times, most recently from fad921c to 8e98a95 Compare September 24, 2025 21:16
tsibley added a commit that referenced this pull request Sep 24, 2025
…run` compatibility

Workflows are now registered under a top level "workflows" key in the
nextstrain-pathogen.yaml file, and each workflow has its own
"compatibility" declarations.  The only current "compatibility"
declaration remains "nextstrain run".

This allows for more meta information about each workflow to be
registered in one place not under the banner of compatibility.

The nextstrain-pathogen.yaml schema changes to a workflow-first
structure like:

  workflows:
    ingest:
      compatibility:
        nextstrain run: yes
    phylogenetic:
      compatibility:
        nextstrain run: no

instead of a compatibility-first structure like:

  compatibility:
    nextstrain run:
      ingest: yes
      phylogenetic: no

as previously introduced¹ (but not yet released).  As it was never
released, don't bother supporting the latter schema at all anymore.

Change first arose out of discussion started by @victorlin in a related
PR.²  Subsequent discussion with @tsibley³ led to dropping support for
the unreleased schema.

¹ <#461>
² <#462 (comment)>
³ <#472 (comment)>

Co-authored-by: Thomas Sibley <tsibley@fredhutch.org>
tsibley added a commit that referenced this pull request Sep 24, 2025
…run` compatibility

Workflows are now registered under a top level "workflows" key in the
nextstrain-pathogen.yaml file, and each workflow has its own
"compatibility" declarations.  The only current "compatibility"
declaration remains "nextstrain run".

This allows for more meta information about each workflow to be
registered in one place not under the banner of compatibility.

The nextstrain-pathogen.yaml schema changes to a workflow-first
structure like:

  workflows:
    ingest:
      compatibility:
        nextstrain run: yes
    phylogenetic:
      compatibility:
        nextstrain run: no

instead of a compatibility-first structure like:

  compatibility:
    nextstrain run:
      ingest: yes
      phylogenetic: no

as previously introduced¹ (but not yet released).  As it was never
released, don't bother supporting the latter schema at all anymore.

Change first arose out of discussion started by @victorlin in a related
PR.²  Subsequent discussion with @tsibley³ led to dropping support for
the unreleased schema.

¹ <#461>
² <#462 (comment)>
³ <#472 (comment)>

Co-authored-by: Thomas Sibley <tsibley@fredhutch.org>
@tsibley tsibley force-pushed the workflow-compatibility branch from 8e98a95 to 20c99d4 Compare September 24, 2025 22:20
Base automatically changed from workflow-compatibility to master September 25, 2025 16:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants