diff --git a/CHANGELOG.md b/CHANGELOG.md index 36613b99..33c35d60 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -11,6 +11,8 @@ releases are available on [PyPI](https://pypi.org/project/pytask) and default pickle protocol. - {pull}`???` adapts the interactive debugger integration to Python 3.14's updated `pdb` behaviour and keeps pytest-style capturing intact. +- {pull}`???` updates the comparison to other tools documentation and adds a section on + the Common Workflow Language (CWL) and WorkflowHub. ## 0.5.7 - 2025-11-22 diff --git a/docs/source/explanations/comparison_to_other_tools.md b/docs/source/explanations/comparison_to_other_tools.md index 2d8131f6..2847c8d1 100644 --- a/docs/source/explanations/comparison_to_other_tools.md +++ b/docs/source/explanations/comparison_to_other_tools.md @@ -10,124 +10,111 @@ in other WMFs. ## [snakemake](https://github.com/snakemake/snakemake) -Pros - -- Very mature library and probably the most adapted library in the realm of scientific - workflow software. -- Can scale to clusters and use Docker images. -- Supports Python and R. -- Automatic test case generation. - -Cons - -- Need to learn snakemake's syntax which is a mixture of Make and Python. -- No debug mode. -- Seems to have no plugin system. +Snakemake is one of the most widely adopted workflow systems in scientific computing. It +scales from local execution to clusters and cloud environments, with built-in support +for containers and conda environments. Workflows are defined using a DSL that combines +Make-style rules with Python, and can be exported to CWL for portability. ## [ploomber](https://github.com/ploomber/ploomber) -General - -- Strong focus on machine learning pipelines, training, and deployment. -- Integration with tools such as MLflow, Docker, AWS Batch. -- Tasks can be defined in yaml, python files, Jupyter notebooks or SQL. - -Pros - -- Conversion from Jupyter notebooks to tasks via - [soorgeon](https://github.com/ploomber/soorgeon). - -Cons - -- Programming in Jupyter notebooks increases the risk of coding errors (e.g. - side-effects). -- Supports parametrizations in form of cartesian products in `yaml` files, but not more - powerful parametrizations. +Ploomber focuses on machine learning pipelines with strong integration into MLflow, +Docker, and AWS Batch. Tasks can be defined in YAML, Python files, Jupyter notebooks, or +SQL, and it can convert notebooks into pipeline tasks. ## [Waf](https://waf.io) -Pros - -- Mature library. -- Can be extended. - -Cons - -- Focus on compiling binaries, not research projects. -- Bus factor of 1. +Waf is a mature build system primarily designed for compiling software projects. It +handles complex build dependencies and can be extended with Python. ## [nextflow](https://github.com/nextflow-io/nextflow) -- Tasks are scripted using Groovy which is a superset of Java. -- Supports AWS, Google, Azure. -- Supports Docker, Shifter, Podman, etc. +Nextflow is a workflow system popular in bioinformatics that runs on AWS, Google Cloud, +and Azure. It uses Groovy (a JVM language) for scripting and has strong support for +containers including Docker, Singularity, and Podman. ## [Kedro](https://github.com/kedro-org/kedro) -Pros - -- Mature library, used by some institutions and companies. Created inside McKinsey. -- Provides the full package: templates, pipelines, deployment +Kedro is a mature workflow framework developed at McKinsey that provides project +templates, data catalogs, and deployment tooling. It is designed for production machine +learning pipelines with a focus on software engineering best practices. ## [pydoit](https://github.com/pydoit/doit) -General - -- A general task runner which focuses on command line tools. -- You can think of it as an replacement for make. -- Powers Nikola, a static site generator. +pydoit is a general-purpose task runner that serves as a Python replacement for Make. It +focuses on executing command-line tools and powers projects like Nikola, a static site +generator. ## [Luigi](https://github.com/spotify/luigi) -General - -- A build system written by Spotify. -- Designed for any kind of long-running batch processes. -- Integrates with many other tools like databases, Hadoop, Spark, etc.. - -Cons - -- Very complex interface and a lot of stuff you probably don't need. -- [Development](https://github.com/spotify/luigi/graphs/contributors) seems to stall. +Luigi is a workflow system built by Spotify for long-running batch processes. It +integrates with Hadoop, Spark, and various databases for large-scale data pipelines. +Development has slowed in recent years. ## [sciluigi](https://github.com/pharmbio/sciluigi) -sciluigi aims to be a lightweight wrapper around luigi. - -Cons - -- [Development](https://github.com/pharmbio/sciluigi/graphs/contributors) has basically - stalled since 2018. -- Not very popular compared to its lifetime. +sciluigi is a lightweight wrapper around Luigi aimed at simplifying scientific workflow +development. It reduces some of Luigi's boilerplate for research use cases. Development +has stalled since 2018. ## [scipipe](https://github.com/scipipe/scipipe) -Cons +SciPipe is a workflow library written in Go for building robust, flexible pipelines +using Flow-Based Programming principles. It compiles workflows to fast binaries and is +designed for bioinformatics and cheminformatics applications involving command-line +tools. -- [Development](https://github.com/scipipe/scipipe/graphs/contributors) slowed down. -- Written in Go. +## [SCons](https://github.com/SCons/scons) -## [Scons](https://github.com/SCons/scons) - -Pros - -- Mature library. - -Cons - -- Seems to have no plugin system. +SCons is a mature, cross-platform software construction tool that serves as an improved +substitute for Make. It uses Python scripts for configuration and has built-in support +for C, C++, Java, Fortran, and automatic dependency analysis. ## [pypyr](https://github.com/pypyr/pypyr) -General +pypyr is a task-runner for automation pipelines defined in YAML. It provides built-in +steps for common operations like loops, conditionals, retries, and error handling +without requiring custom code, and is often used for CI/CD and DevOps automation. + +## [ZenML](https://github.com/zenml-io/zenml) -- A general task-runner with task defined in yaml files. +ZenML is an MLOps framework for building portable ML pipelines that can run on various +orchestrators including Kubernetes, AWS SageMaker, GCP Vertex AI, Kubeflow, and Airflow. +It focuses on productionizing ML workflows with features like automatic +containerization, artifact tracking, and native caching. -## [zenml](https://github.com/zenml-io/zenml) +## [Flyte](https://github.com/flyteorg/flyte) -## [flyte](https://github.com/flyteorg/flyte) +Flyte is a Kubernetes-native workflow orchestration platform for building +production-grade data and ML pipelines. It provides automatic retries, checkpointing, +failure recovery, and scales dynamically across cloud providers including AWS, GCP, and +Azure. ## [pipefunc](https://github.com/pipefunc/pipefunc) -A tool for executing graphs made out of functions. More focused on computational -compared to workflow graphs. +pipefunc is a lightweight library for creating function pipelines as directed acyclic +graphs (DAGs) in pure Python. It automatically handles execution order, supports +map-reduce operations, parallel execution, and provides resource profiling. + +## [Common Workflow Language (CWL)](https://www.commonwl.org/) + +CWL is an open standard for describing data analysis workflows in a portable, +language-agnostic format. Its primary goal is to enable workflows to be written once and +executed across different computing environments—from local workstations to clusters, +cloud, and HPC systems—without modification. Workflows described in CWL can be +registered on [WorkflowHub](https://workflowhub.eu/) for sharing and discovery following +FAIR (Findable, Accessible, Interoperable, Reusable) principles. + +CWL is particularly prevalent in bioinformatics and life sciences where reproducibility +across institutions is critical. Tools that support CWL include +[cwltool](https://github.com/common-workflow-language/cwltool) (the reference +implementation), [Toil](https://github.com/DataBiosphere/toil), +[Arvados](https://arvados.org/), and [REANA](https://reanahub.io/). Some workflow +systems like Snakemake and Nextflow can export workflows to CWL format. + +pytask is not a CWL-compliant tool because it operates on a fundamentally different +model. CWL describes workflows as graphs of command-line tool invocations where data +flows between tools via files. pytask, in contrast, orchestrates Python functions that +can execute arbitrary code, manipulate data in memory, call APIs, or perform any +operation available in Python. This Python-native approach enables features like +interactive debugging but means pytask workflows cannot be represented in CWL's +command-line-centric specification.