-
Notifications
You must be signed in to change notification settings - Fork 20
Register workflow files in nextstrain-pathogen.yaml
#462
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
| *(["--configfile=%s" % (resolved_configfile)] | ||
| if resolved_configfile else []), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OH, just realized this overrides the custom config in the analysis directory...will think on this...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I remember it took me a bit of digging figure out snakemake's config handling. I was alluding to this here, but a much more thorough comment is this one from the original avian-flu PR. It all makes sense once you figure it out, but it's confusing before that because configfile: does more than you (I!) thought it does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put another way, I think --configfile base overlay or
configfile: base
configfile: overlay
or configfile: base + --configfile overlay will all work; thinking through how the third example actually works, and what that says about what configfile: is doing behind the scenes was the 💡 moment for me. (The second example is how we run external analysis directories with a default config.yaml overlay, and the third is the documented way to run external analysis directories with a differently-named overlay YAML.) But --configfile base plus configfile: overlay doesn't work, and I don't think it can be made to work in a nice way. (I think that's what this PR is trying to do, but maybe not?)
As a thought experiment, I think the runner (via nextstrain-pathogen.yaml) could define --config base_configfile: base and then in the snakefile we leverage that to do something like
if 'base_configfile' in config:
configfile: path.join(repo_directory, config['base_configfile' ])
but I'm sure there's a nicer way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah thanks for the pointers @jameshadfield!
I was reading through the Snakemake config docs again and noticed that there is also an API method for overwriting configs (snakemake.utils.update_config). As another experiment, I thought we could replace how we incorporate the overlay config:
diff --git a/phylogenetic/Snakefile b/phylogenetic/Snakefile
index 57a9da9474..399e6eef9f 100644
--- a/phylogenetic/Snakefile
+++ b/phylogenetic/Snakefile
@@ -16,7 +16,9 @@
# Use custom configuration from analysis directory (i.e. working dir), if any.
if os.path.exists("config.yaml"):
- configfile: "config.yaml"
+ with open("config.yaml", "r", encoding = "utf-8") as f:
+ custom_config = yaml.safe_load(f)
+ snakemake.utils.update_config(config, custom_config)Unfortunately, this means the overlay also overrides any --config, which is not an issue for nextstrain run, but is unexpected behavior for directly running snakemake or nextstrain build...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a thought experiment, I think the runner (via
nextstrain-pathogen.yaml) could define--config base_configfile: baseand then in the snakefile we leverage that to do something likeif 'base_configfile' in config: configfile: path.join(repo_directory, config['base_configfile' ])
Tried this out locally and it works as expected if it's defined before configfile: overlay. So for our workflows the config merging order will be
--config base_configfile=... < configfile: overlay < --configfile < --config
Not sure how I feel about that special case for base_configfile yet...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that seems... fine. I think! It's not nextstrain run specific per-se as you can run via snakemake --configfile base overlay pretty easily.
And this should work nicely with custom-named config overlays as you can add --configfile overlay.yaml (if nextstrain run allows you to add your own snakemake args?!?)
Would you follow this up with the removal of lines such as these in measles?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not nextstrain run specific per-se as you can run via snakemake --configfile base overlay pretty easily.
Ah, I meant implemented specifically for nextstrain run here since this change shouldn't affect how workflows run for snakemake or nextstrain build calls.
And this should work nicely with custom-named config overlays as you can add --configfile overlay.yaml (if nextstrain run allows you to add your own snakemake args?!?)
As of right now, nextstrain run does not allow you to pass in additional snakemake CLI args (maybe it would work with a default Snakemake profile?)
Would you follow this up with the removal of lines such as these in measles?
I think so? Unless we expect the config.yaml in the analysis directory to get automatically used when running with snakemake or nextstrain build.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless we expect the
config.yamlin the analysis directory to get automatically used when running withsnakemakeornextstrain build.
We have two different scenarios:
- There's clearly a default configfile (e.g. measles, avian-flu)
- There's no one default, you have to select it somehow (e.g. mpox)
Outside of nextstrain run I'm not sure of the best behaviour for (1) - should you be forced to supply the default --configfile yourself, or can the Snakefile (or config.smk etc) do that for you as it currently does. (Note that you already have to supply --snakefile ....) I guess consistency across (1) and (2) would be nice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, probably best to write out how all of the different methods of running the workflow will look...
For (1) - a pathogen that does have a Snakemake code to load default config and automatically use the analysis config overlay such as measles, the commands are pretty simple:
snakemake \
--snakefile phylogenetic/Snakefile \
--directory /tmp/measles-phylo
# Hypothetical custom build config within phylogenetic/build-configs/overlay
nextstrain build phylogenetic \
--directory build-configs/overlay/ \
The nextstrain-pathogen.yaml will be simple for nextstrain run since it does not need to define the snakefile or config files:
---
compatibility:
nextstrain run:
ingest: ~
phylogenetic: ~
nextclade: ~For (2) - a pathogen without a default config, the automatic use of analysis config overlay does not work as expected. There will be no Snakemake code to define configfile, so we must always specify both the base config and overlay via the --configfile option:
snakemake \
--snakefile phylogenetic/Snakefile \
--directory /tmp/mpox-phylo-clade-i \
--configfile phylogenetic/defaults/clade-i/config.yaml \
--configfile /tmp/mpox-phylo-clade-i/config.yaml
# Hypothetical custom build config within phylogenetic/build-configs/overlay
nextstrain build phylogenetic \
--directory build-configs/overlay/ \
--configfile defaults/clade-i/config.yaml \
--configfile build-configs/overlay/config.yaml
The nextstrain-pathogen.yaml can will define the snakefile and configfiles for nextstrain run:
---
compatibility:
nextstrain run:
ingest: ~
phylogenetic/all-clades:
snakefile: phylogenetic/Snakefile
configfile: phylogenetic/defaults/mpxv/config.yaml
phylogenetic/clade-I:
snakefile: phylogenetic/Snakefile
configfile: phylogenetic/defaults/clade-i/config.yaml
phylogenetic/clade-IIb:
snakefile: phylogenetic/Snakefile
configfile: phylogenetic/defaults/hmpxv1/config.yaml
phylogenetic/lineage-B.1:
snakefile: phylogenetic/Snakefile
configfile: phylogenetic/defaults/hmpxv1_big/config.yamlIf we remove the default base config and automatic use of analysis config overlays, then the commands for measles would be more verbose and consistent with mpox. However, after writing this out, I'm not sure that consistency is worth it, especially since the --configfile for the base config will always be the same for measles...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Can you add to that comment what the nextstrain-pathogen.yaml would look like and if the workflow uses snakemake code to load the default/config overlays. I'm guessing measles has a YAML which doesn't define any configfiles and does have snakemake code like this, and (2) is the opposite?
…config Because of the order in which Snakemake merges configs, we must also provide the user's config from their analysis directory to override workflow configs provided via the `--configfile` option. See detailed discussion in <#462 (comment)>
Use the `nextstrain-pathogen.yaml` to list the available phylogenetic workflows for `nextstrain run`. Requires changes from <nextstrain/cli#462> to properly use the correct Snakefile and configfile for each workflow.
Because of the order in which Snakemake merges configs, we must also provide the user's config from their analysis directory to override workflow configs provided via the `--configfile` option. See detailed discussion in <#462 (comment)>
b14e591 to
81f6365
Compare
Because of the order in which Snakemake merges configs, we must also provide the user's config from their analysis directory to override workflow configs provided via the `--configfile` option. See detailed discussion in <#462 (comment)>
81f6365 to
a216165
Compare
List workflows under a top level `workflows` key in the `nextstrain-pathogen.yaml` file instead of listing workflows under `compatibility['nextstrain run']`. Suggested by @victorlin <#462 (comment)>
List compatible workflows under a top level `nextstrain_run_workflows` key in the `nextstrain-pathogen.yaml` file instead of listing workflows under `compatibility['nextstrain run']`. The `compatibility['nextstrain run']` boolean is still supported for backwards compatibility. From discussion with @victorlin in <#462 (comment)>
Because of the order in which Snakemake merges configs, we must also provide the user's config from their analysis directory to override workflow configs provided via the `--configfile` option. See detailed discussion in <#462 (comment)>
List workflows under a top level `workflows` key in the `nextstrain-pathogen.yaml`
file where each workflow has it's own `compatibility` key:
```yaml
workflows:
ingest:
compatibility:
nextstrain run: True
phylogenetic/all-clades:
compatibility:
nextstrain run: True
snakefile: phylogenetic/Snakefile
configfile: phylogenetic/defaults/mpxv/config.yaml
```
The top level `compatibility['nextstrain run']` boolean is still supported
for backwards compatibility.
From discussion with @victorlin in
<#462 (comment)>
a216165 to
f04761d
Compare
| ## Improvements | ||
|
|
||
| * `nextstrain run` supports running workflows with defined Snakefiles and | ||
| configfiles in the `nextstrain-pathogen.yaml` file. This is mainly relevant |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(non-blocking) There's no documentation for the nextstrain-pathogen.yaml schema yet, but it might be good to add soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For sure, I was going to add the schema next. I'm thinking I'll stick with jsonschema that we already use for Augur.
|
I think this specifying of the Snakefile and configfile (the "entrypoint") for each workflow in I'm not sure I've grokked all the hurdles encountered in nextstrain/mpox#327, but my understanding is they pushed you towards nextstrain/mpox#314 and this solution instead? Perhaps we can discuss the hurdles and options around them? |
|
That said, +1 for supporting per-workflow registration metadata declared in ---
workflows:
ingest:
compatibility:
nextstrain run: True
phylogenetic/all-clades:
compatibility:
nextstrain run: True
other stuff: … |
👍 I'll pull this out as a separate change as a follow up to #461.
I think the directory structure makes sense for workflows that are completely separate workflows (e.g. ingest vs phylo), but not as clear for workflows that essentially running the exact same Snakemake workflow with different configs like mpox phylo. Basically, I was not a fan of the nested directory structure that I landed on in nextstrain/mpox#327, especially when trying to figure out the colocation/separation of config files. Maybe this is more of reflection on the structure of the mpox phylo workflow rather than the |
Maybe this is the crucial point: I think the pattern we should use for our workflows is that each workflow has a name ( The idea here is that providing multiple default/stock configs and associated data quickly becomes tangled and there's not a single name to refer to that collection of things. It's more explicit and conceptually cleaner (if more verbose at times) to give that default config a name and then collect everything it needs under that name. This keeps the interface simpler. In my view, the use of multiple default/stock configs like in mpox was an attempt to give each variant of the workflow a name and we should follow thru with that more comprehensively. (Perhaps mpox wants to transition to a single multi-build workflow if we feel like we have a better grasp on the interface for those now, e.g. with globbing in config keys, but it wouldn't have to.)
If we want to keep mpox as not a multi-build workflow, then I think we can work on ways to improve the directory structure there. I'd suggest starting out by sharing less, e.g. doing away with direct config/workflow access to |
List workflows under a top level `workflows` key in the `nextstrain-pathogen.yaml`
file where each workflow has it's own `compatibility` key:
```yaml
workflows:
ingest:
compatibility:
nextstrain run: True
phylogenetic:
compatibility:
nextstrain run: True
```
The top level `compatibility['nextstrain run']` boolean is still supported
for backwards compatibility.
From discussion with @victorlin in
<#462 (comment)>
List workflows under a top level `workflows` key in the `nextstrain-pathogen.yaml`
file where each workflow has it's own `compatibility` key:
```yaml
workflows:
ingest:
compatibility:
nextstrain run: True
phylogenetic:
compatibility:
nextstrain run: True
```
The top level `compatibility['nextstrain run']` boolean is still supported
for backwards compatibility.
From discussion with @victorlin in
<#462 (comment)>
Allow maintainers to define paths to each workflow's Snakefile and configfile
in the nextstrain-pathogen.yaml. Using mpox as an example, this allows the named
workflows to use the shared Snakefile and their own custom configfiles without
needing to create separate Snakefiles for each workflow.
```
---
compatibility:
nextstrain run:
ingest: ~
phylogenetic/all-clades:
snakefile: phylogenetic/Snakefile
configfile: phylogenetic/defaults/mpxv/config.yaml
phylogenetic/clade-I:
snakefile: phylogenetic/Snakefile
configfile: phylogenetic/defaults/clade-i/config.yaml
phylogenetic/clade-IIb:
snakefile: phylogenetic/Snakefile
configfile: phylogenetic/defaults/hmpxv1/config.yaml
phylogenetic/lineage-B.1:
snakefile: phylogenetic/Snakefile
configfile: phylogenetic/deafults/hmpxv1_big/config.yaml
```
Because of the order in which Snakemake merges configs, we must also provide the user's config from their analysis directory to override workflow configs provided via the `--configfile` option. See detailed discussion in <#462 (comment)>
f04761d to
1e4a963
Compare
…run` compatibility
Workflows are now registered under a top level "workflows" key in the
nextstrain-pathogen.yaml file, and each workflow has its own
"compatibility" declarations. The only current "compatibility"
declaration remains "nextstrain run".
This allows for more meta information about each workflow to be
registered in one place not under the banner of compatibility.
The nextstrain-pathogen.yaml schema changes to a workflow-first
structure like:
workflows:
ingest:
compatibility:
nextstrain run: yes
phylogenetic:
compatibility:
nextstrain run: no
instead of a compatibility-first structure like:
compatibility:
nextstrain run:
ingest: yes
phylogenetic: no
as previously introduced¹ (but not yet released). As it was never
released, don't bother supporting the latter schema at all anymore.
Change first arose out of discussion started by @victorlin in a related
PR.² Subsequent discussion with @tsibley³ led to dropping support for
the unreleased schema.
¹ <#461>
² <#462 (comment)>
³ <#472 (comment)>
Co-authored-by: Thomas Sibley <tsibley@fredhutch.org>
fad921c to
8e98a95
Compare
…run` compatibility
Workflows are now registered under a top level "workflows" key in the
nextstrain-pathogen.yaml file, and each workflow has its own
"compatibility" declarations. The only current "compatibility"
declaration remains "nextstrain run".
This allows for more meta information about each workflow to be
registered in one place not under the banner of compatibility.
The nextstrain-pathogen.yaml schema changes to a workflow-first
structure like:
workflows:
ingest:
compatibility:
nextstrain run: yes
phylogenetic:
compatibility:
nextstrain run: no
instead of a compatibility-first structure like:
compatibility:
nextstrain run:
ingest: yes
phylogenetic: no
as previously introduced¹ (but not yet released). As it was never
released, don't bother supporting the latter schema at all anymore.
Change first arose out of discussion started by @victorlin in a related
PR.² Subsequent discussion with @tsibley³ led to dropping support for
the unreleased schema.
¹ <#461>
² <#462 (comment)>
³ <#472 (comment)>
Co-authored-by: Thomas Sibley <tsibley@fredhutch.org>
…run` compatibility
Workflows are now registered under a top level "workflows" key in the
nextstrain-pathogen.yaml file, and each workflow has its own
"compatibility" declarations. The only current "compatibility"
declaration remains "nextstrain run".
This allows for more meta information about each workflow to be
registered in one place not under the banner of compatibility.
The nextstrain-pathogen.yaml schema changes to a workflow-first
structure like:
workflows:
ingest:
compatibility:
nextstrain run: yes
phylogenetic:
compatibility:
nextstrain run: no
instead of a compatibility-first structure like:
compatibility:
nextstrain run:
ingest: yes
phylogenetic: no
as previously introduced¹ (but not yet released). As it was never
released, don't bother supporting the latter schema at all anymore.
Change first arose out of discussion started by @victorlin in a related
PR.² Subsequent discussion with @tsibley³ led to dropping support for
the unreleased schema.
¹ <#461>
² <#462 (comment)>
³ <#472 (comment)>
Co-authored-by: Thomas Sibley <tsibley@fredhutch.org>
8e98a95 to
20c99d4
Compare
Description of proposed changes
Allow maintainers to define paths to each workflow's Snakefile and configfile in the nextstrain-pathogen.yaml. Using mpox as an example, this allows the named workflows to use the shared Snakefile and their own custom configfiles without needing to create separate Snakefiles for each workflow.
Try testing this with
Related issue(s)
Related to #454
Related to nextstrain/mpox#330
Checklist