This tutorial provides an example of a full integration workflow from data download to downstream analysis using the scAtlasTb.
There are two workflows defined under configs/ that use the Hrovatin et al. (2023) datasets.
configs/qc.yaml: Quality control workflow including doublet detection.configs/integration_benchmark.yaml: Full integration benchmark workflow using multiple integration methods and evaluating integration performance using a variety of metrics.
The workflow has been built for Linux distributions and relies on Conda for environment management. Please ensure you have either Miniforge, Conda or Miniconda installed. The toolbox has also been tested on macOS, but hardware acceleration may not be available.
Clone this repository as well as the scAtlasTb and make sure that both are in the same parent directory:
git clone https://github.com/lueckenlab/scAtlasTb_Tutorial.git
git clone https://github.com/HCA-integration/scAtlasTb.gitNote: This can take some time, so make sure you prepare this ahead of time. If you have already set up scAtlasTb and have the conda environments installed, you can skip this step.
Set up the conda environments from scAtlasTb/envs as described in the scAtlasTb documentation.
For this tutorial we recommend you use the local environment mode.
You will need the following environments:
snakemake: for running the workflowqc: for the doublet computation and quality control workflowscvi-tools: for integration with scVI-based methodsharmony_cpu: for Harmony integration (orrapids_singlecellorharmony_pytorchfor GPU-based Harmony)drvi(optional): for DRVI integrationscarches(optional): for scPoli integrationrapids_singlecell(optional): for GPU-accelerated scanpy operationsscib: for computing integration metricsfunkyheatmap: for advanced metrics visualizations (check out the documentation on how to use it with Apple Silicon)
The only exception is scanpy, which you should install from this repository (scAtlasTb_Tutorial) to ensure compatibility with the downstream analysis example.
You can find the environment file under scAtlasTb_Tutorial/envs/scanpy.yaml, as well as instructions under scAtlasTb_Tutorial/envs/README.md.
If you don't have Jupyter Lab installed yet, you can set it up in a separate conda environment:
conda env create -f envs/jupyterlab.yamlPlease refer to the README in the envs folder for more details.
The tutorial uses publicly available datasets from Hrovatin et al. (2023) as example data for the integration benchmark.
Use the provided notebook under notebooks/Hrovatin_2023.ipynb to download and prepare the data.
The dataset used by the tutorial will be stored in data/Hrovatin_2023.zarr.
Since this tutorial is already set up with functioning configuration files, you can directly run the workflow after setting up the conda environments.
Activate the snakemake environment before running the workflow:
conda activate snakemakeThen, you can run the workflow with the following command:
bash run.sh
<target> -nqThe target can be anything defined by the pipeline that is input to the run.sh script.
The flag -n will enable dry-run mode, which allows you to see what jobs would be executed without actually running them, while -q suppresses the output of Snakemake to only show a summary of the workflow.
You should always run the workflow first in dry-run mode to ensure everything is set up correctly.
Note: Please refer to the documentation for more details on available targets and how to run the workflow with different options.
To run the QC workflow, use the following command:
bash run.sh
qc_all -nqwhich should give you the following output:
Building DAG of jobs...
Error: Directory cannot be locked. This usually means that another Snakemake instance is running on this directory. Another possibility is that a previous run exited unexpectedly.
Job stats:
job count
---------------------- -------
doublets_collect 9
doublets_split_batches 9
qc_all 1
qc_autoqc 9
qc_get_thresholds 9
qc_merge_thresholds 1
qc_plot_joint 9
qc_plot_removed 9
split_data_link 9
split_data_split 1
total 66
You can ignore any warnings that appear before the yellow Snakemake log.
Inspect the config file under configs/qc.yaml to see which steps are included in the workflow and how they match with the dry-run output.
Consider adjusting the workflow in the config e.g. if you have limited resources and want to simplify the workflow.
If the dry-run works as expected, you can run the actual workflow with multiple cores:
bash run.sh
qc_all -c3Be mindful of your computational resources and avoid using all cores available on your machine, especially if you have limited memory.
When in doubt, use a single core (-c1).
Once the workflow has finished, you can inspect the output under data/images/qc/.
Refer to the scAtlasTb documentation for more details on the output files.
The integration benchmark workflow is a lot more complex than the QC workflow and may take a long time to run depending on your hardware.
Look into the config file under configs/integration_benchmark.yaml to see which steps are included in the workflow and consider adjusting the workflow in the config e.g. by removing some integration methods if resources are limited.
Since the workflow contains many more options that are re-used by different steps, much of the defaults are configured under configs/defaults.yaml.
Read about defaults in the scAtlasTb documentation.
Check the dry-run output first:
bash run.sh
integration_all metrics_all -nqYou can specify any target that is defined in one of the input maps in the config files.
Call the actual integration workflow with multiple cores:
bash run.sh
integration_all -c3If any methods fail or the workflow takes too long and you just want a proof-of-concept, consider adjusting the workflow in the config e.g. by removing some integration methods.
If the workflow has finished successfully, you can inspect the integration UMAPs under data/images/integration/umap/.
Continue with the metrics to complete the benchmark:
bash run.sh
metrics_all -c3The metrics plots will be stored under data/images/metrics/.
There are additional post-processing steps defined in the benchmark workflow.
They contain splitting by cell type, clustering, label transfer and marker gene computation.
The collect step collects the different integration outputs and combines them into a single AnnData file for easier downstream analysis.
Finally, the majority_voting step computes consensus labels based on the label transfer results from the different integration methods.
bash run.sh
clustering_all majority_voting_all -c3The final output will be stored under data/pipeline/majority_voting/dataset~integration_benchmark_beta/.
Follow the notebook under notebooks/Evaluate_integrations.ipynb for an example of downstream analysis using the integrated data and consensus labels from the benchmark workflow.