BADGER (Benchmark Ancient DNA GEnetic Relatedness) is an automated snakemake pipeline designed to jointly benchmark the classification performance and accuracy of several previously published ancient DNA genetic relatedness estimation methods. To generate its input test data, BADGER leverages both high-definition pedigree simulations, followed by the simulation of raw ancient DNA .fastq sequences, through an extensive use of the softwares ped-sim and gargammel, respectively.
![]() |
|---|
Figure 1: A summarized diagram view of the BADGER workflow |
git clone --recursive https://github.com/MaelLefeuvre/BADGER.git
cd ./BADGERImportant
Note that all subsequent commands described in this README are executed from the root directory of this repository.
BADGER is generally designed to operate in high-performance computing environments, and will benefit from as many CPU cores, memory and disk space as possible. The following specifications are the minimum system requirements for running BADGER:
- Processor: 24 CPU-cores
- Memory: 32GB of RAM
- Storage: 128GB of available disk space
The installation of BADGER is currently only supported on GNU/Linux operating systems (e.g. Ubuntu, Debian, ArchLinux). Although other UNIX-type operating systems, such as MacOS, FreeBSD and Solaris, are technically compatible, they are not currently being tested as part of the continuous testing of the software. If you experience any compatibility issues when installing BADGER on an unsupported UNIX-type system, feel free to submit an issue in this GitHub repository to help us improve the software's interoperability.
BADGER currently supports two alternative methods of installation to install the badger and badger-plots command line programs. These two installation procedures are mutually exclusive, but can be chosen interchangeably, depending on your personal preference, your knowledge of one or the other of the required environment managers, and/or your workstation's current setup:
- Using miniconda. This procedure should be considered as the default installation method. See the chapter conda installation
- Using pixi. This procedure is currently experimental, but may become the default in future updates of
badger. See the chapter pixi installation
BADGER is written using the snakemake workflow management system and relies extensively on the conda environment manager to ensure both interoperability and reproducibility. Hence, a requirement of using BADGER is that users first install conda within their system, since the badger command line interface in itself is designed to be embedded within a conda environment...
To install conda, check the documentation of miniconda3 and review the detailled installation instructions beforehand.
On x86_64-Linux architectures, a default installation may be achieved using the following commands:
MINICONDA_URL="https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh"
wget $MINICONDA_URL && bash $(basename $MINICONDA_URL)This command should seamlessly create a dedicated conda environment for BADGER, called badger-0.4.1.
bash ./badger/install.shThis environment should contain the following programs and dependencies:
| software | version |
|---|---|
python |
3.11.0 |
R |
>=4.1.2 |
snakemake |
7.20.0 |
badger |
>=0.5.2 |
badger-plots |
>=0.5.2 |
A very basic test suite can be run using the following command, to ensure every program and dependency can be found within the path, as well as running several integration tests.
./badger/install.sh testTo access and run the badger and badger-plots command line programs using conda, one should simply activate the conda environment in the root of this repository
conda activate badger-0.5.2 # activate the environment
badger --help # run the badger CLI
badger-plots --help # run the badger-plots CLITip
For users wishing to manually install BADGER within a custom environment, a detailled step-by-step procedure may be found here: manual installation
Alternatively, badger can also be fully embedded into a pixi environment, to ensure a complete isolation of all of its dependencies (including conda).
Should this installation procedure be your preferred option, a requirement of using BADGER is then to have a working installation of the pixi package manager within the system. To install pixi, check the
installation instructions on their website
beforehand.
On x86_64-Linux architectures, a default installation may be achieved using the following commands:
wget -qO- https://pixi.sh/install.sh | shThese commands should seamlessly create a dedicated default pixi environment containing both badger and badger-plots, called badger.
pixi install # install the default environment
pixi run configure # Install and configure badger-plots. (This is a one-time operation)This environment should contain the following programs and dependencies:
| software | version |
|---|---|
python |
3.11.0 |
R |
>=4.1.2 |
snakemake |
7.20.0 |
badger |
>=0.5.2 |
badger-plots |
>=0.5.2 |
A very basic test suite can be run using the following command, to ensure every program and dependency can be found within the path, as well as running several integration tests.
pixi run testTo access and run the badger and badger-plots command line programs using pixi, one can either start a pixi shell in the root of this repository
pixi shell # Badger setup...Or use dedicated pixi tasks
pixi run badger --help # run the badger CLI in a pixi environment
pixi run badger-plots --help # run the badger CLI in a pixi environmentNote that every subsequent commands in this README will assume they are directly run from the pixi shell, but they can of course also work using the dedicated pixi tasks method, by simply prepending the given command with "pixi run"
Badger will require several datasets in order to run properly, mainly:
- The 1000g-phase3 genotype dataset, version v5b-20130502
- The Allen Ancient DNA Resource "1240K" compendium dataset, version v52.2 (Mallick et al. 2024)
- The GRCh37 Reference genome assembly (release-113)
- The HapMapII genetic recombination map
- The Refined sex-specific genetic map from (Bhérer et al 2017)
- The sex-specific cross-over interference map from (Campbell et al 2015)
- The supporting dataset and files of TKGWV2 (Fernandes, et al. 2021)
Here, BADGER is bundled with a workflow specifically tailored to download these datasets and place them in their correct locations for future runs:
badger setupThis module will:
- Download and place all the datasets required for future runs in a
./datadirectory. - Pre-create all the required conda environments for future runs. (Note that these environments are only local to this specific workflow and will be located under the hidden folder
.snakemake/conda)
Warning
By default, badger will run snakemake using half of the available cores and memory. This behaviour can be modified using the --cores and --mem_mb arguments.
Tip
If you wish to instead manually download some or all of the listed datasets required to run BADGER, detailled download instructions may be found here: Manual download instructions
The simulation parameters, and general behaviour of BADGER can be configured by modifying the config/config.yml configuration file. Note that this file generally follows the YAML format specifications and is provided as-is to snakemake when running the pipeline.
Important
An extensive explanation of every keyword may be found here: config parameters reference.
Note that sensible defaults for all parameters are provided in this file, and BADGER is expected to run smoothly without modifying this file. However, we recommend that users should at least modify the parameters of gargammel and ped-sim in order to tailor BADGER's benchmarking results to their own use-cases:
Example gargammel configuration:
gargammel:
coverage: 0.05 # (average sequencing depth (X) [0-Inf])
comp-endo: 0.98 # (proportion of endogenous sequences [0-1])
comp-cont: 0.02 # (proportion of human contaminant sequences [0-1])
comp-bact: 0.05 # (proportion of bacterial contamination)
pmd-model: "briggs" # See (Briggs et al 2007)
briggs:
nick-frequency: 0.024 # Per-base nick frequency rate (nu)
overhang-length: 0.36 # Single-stranded overhang rate (lambda)
ds-deaminations: 0.0097 # Double-stranded DNA deamination rate (delta)
ss-deaminations: 0.68 # Single-stranded DNA deamination (delta_ss)For a more advanced use, the documentation of gargammel-1.1.4 may be found here: gargammel: simulations of ancient DNA datasets
Example ped-sim configuration
ped-sim:
replicates: 1 # Number of pedigree replicates for this run.
data:
codes: "resources/ped-sim/ped-definition/outbred/pedigree_codes.txt"
definition: "resources/ped-sim/ped-definition/outbred/pedigree.def"
params:
pop: "TSI" # 1000g-phase3 (super-)population codeHere, the files provided in data['codes'] and data['definition'] should provide with a good starting point to start benchmarking the available kinship estimation methods up to the third degree. Advanced configurations of BADGER for specific use-cases might however require to design custom files. When such is the case, detailled explanations on the format and purpose of these files may be found by following the link here:i advanced-ped-sim-configuration
The basic workflow of BADGER is divided into several steps, each with their dedicated command line program and module:
---
config:
look: handDrawn
theme: dark
---
graph TD;
setup@{ shape: sm-circ, label="setup" };
setup -.-> run;
subgraph badger
run@{ shape: rounded };
archive@{ shape: rounded};
run --> archive;
archive --> run;
end;
subgraph badger-plots
archive --> make-input;
archive --> template;
make-input --> plot;
template --> plot;
make-input@{ shape: rounded};
template@{ shape: rounded};
plot@{ shape: rounded};
end;
view@{ shape: fr-circ};
plot --> view;
- Run and archive multiple benchmark replicates using the
badgerhelper command line program- Running a benchmark replicate is mainly done by the
badger runmodule - As the uncompressed output of BADGER may use up a lot of disk space, the
badger archivemodule should be applied to efficiently store results in a dedicated directory. - As running hundreds of benchmark replicates in parallel using whole genome sequences can be computationally intensive, we recommend the use of
badger loop-pipeline, to subdivide the replicates into more manageable chunks. These chunks can act as data checkpoints and will be automatically archived by the software.
- Running a benchmark replicate is mainly done by the
- Plot and summarize the benchmarking results using
badger-plots. Plotting these results will require the use of two yaml files:- A first yaml is in charge of detailling which archived results should be used for plotting. This file can be easily generated using the
badger-plots make-inputmodule. - A second yaml file is in charge of providing with plotting parameters. A preconfigured template for this parameter file can be obtained by querying the
badger-plots templatemodule. - Plotting can then be achieved using
badger-plots plot, once in possession of these two yaml files.
- A first yaml is in charge of detailling which archived results should be used for plotting. This file can be easily generated using the
Note
This quick start example assumes you have correctly installed and setup BADGER, and that its corresponding conda environment has been activated. See the dedicated corresponding sections Installing BADGER and Configuring BADGER if you have not already done so.
Once properly configured, running a single run of BADGER can be executed using the following command
badger run --cores 32 --mem-mb 64000 --verboseThis example will start the snakemake pipeline, with sensible defaults, while requesting 32 cores and 64 gigabytes of memory as computational resources, and using the parameters specified in config/config.yml.
Once completed, all generated results should be contained within a ./results directory at the root of BADGER's repository.
Once an iteration of badger run has been completed, the simulation results may be stored and archived in a compressed form, in a separate directory. The directory may be specified by providing a valid path in the config/config.yml file
archive:
archive-dir: "/path/to/an/archive/folder"
compress-level: 9badger archiveThis command will generate a subdirectory within the specified archive directory, called run-<runid>, and copy the final results of the previous BADGER run in an archived form. Note that BADGER does not use any proprietary compression format, and only uses .cram, .xz and .tar.gz formats to compress these results, and all of the resulting files may still be manually reviewed using tar, xz and samtools split. Here, BADGER provides with a dedicated module to conveniently decompress these archived results. See the dedicated chapter Unpacking archived runs
Click here for a detailled summary of the structure of a BADGER archive:
archive/
├── archive-metadata.yml # Global list of md5 checksum for every archived file
├── run-000/ # Archived results of a first BADGER run
│ ├── config # Stored config/config.yml file.
│ └── results
│ ├── 01-gargammel
│ │ └── contaminants # 1000g sample id of contaminating individuals used during this run
│ ├── 02-preprocess
│ │ ├── 05-dedup # Per-pedigree processed binary alignment files (.cram format)
│ │ └── 06-mapdamage
│ ├── 03-variant-calling
│ │ └── 00-panel # SNP positions targeted during this run (.bed.xz format)
│ ├── 04-kinship # Per-pedigree kinship estimation results of each method for this run
│ │ ├── correctKi
│ │ ├── GRUPS
│ │ ├── KIN
│ │ ├── READ
│ │ ├── READv2
│ │ └── TKGWV2
│ └── meta # Selected seed, and git version hash of the softwares used during this run.
│
└──run-001/ # Archived results of a second BADGER run ...
...Note that some of these files (most notably those located under run-<runid>/results/02-preprocess) are not needed to estimate classification performance, but may serve as a useful checkpoint to re-apply BADGER from already processed bam files.
As the process of running and archiving hundreds of pedigree simulation replicates in parallel may grow the number of snakemake jobs to unmanageable levels for the available computational resources, the badger loop-pipeline can here be used to sequentially run and archive several rounds of BADGER simulations. Hence, the following command...
badger loop-pipeline --iterations 10 -c 32 -m 48000Is expected to sequentially run the badger run and badger archive modules for a total of ten loops, by leveraging a total of 32 cores and 48GB of memory during every run, and using the same config/config.yml configuration file.
Following this example, and upon the completion of the command, users should thus expect a total of 10 archived runs in their specified archive directory:
Click here for a detailled view of the expected directory structure after running the previous command:
archive/
├── archive-metadata.yml
├── run-000/
│ └── ...
├── run-000/
├── run-000/
├── run-000/
├── run-000/
├── run-000/
├── run-000/
├── run-000/
├── run-000/
└── run-000/
The badger unpack module provides with a set of utilities for users wishing to review previously archived results in a decompressed form. Hence, the following command:
badger unpack all --archive-dir ./path/to/archive/dir --output-dir ./path/to/output/dirWill target all of the files found in the specified BADGER archive directory (--archive-dir), and create a decompressed copy of these files in the specified output directory (--output-dir)
Note that specific filetypes may also be requested. Thus the following command:
badger unpack READ READv2 KIN -a /path/to/archive/dir -o /path/to/output/dirwill instead only decompress the kinship estimation results of READ, READv2 and KIN. A complete list of all available filetypes and arguments may be obtained by running the help command badger unpack -h
Users wishing to re-apply a previously archived round of BADGER simulations may do so using the badger rerun module. Doing so may be helpful to evaluate the impact of modifying variant calling and/or kinship estimation parameters on classification performance, using a constant dataset.
badger rerun --input-archive-dir /path/to/input/archive/dir.Will sequentially run badger unpack, badger run and badger archive on the specified --input-archive-dir. Note that program will automatically archive the updated results, in the directory specified through the archive-dir: keyword, within the usual config/config.yml
Summarizing the output of multiple BADGER runs through statistical analysis and plotting is handled by the badger-plots command line program. Using badger-plots will generally imply that you have applied BADGER in multiple replicates on several parameter sets, each representing a given biological condition (e.g. applying BADGER on several simulated average sequencing depths, by modifying the value of gargammel['comp-endo'] in the usual config/config.yml file).
Important
Note that each simulated biological condition is expected to be stored in a separate archive directory for badger-plots to work, as the program generally relies on the assumption that the data was compressed and structured using badger archive.
Here, throughout this section, we'll assume that a user had previously generated multiple replicates of BADGER to estimate the impact of single-strand deaminations (range: [0, 30]%), at a set sequencing depth of 0.02X
Here, the user has collected every corresponding set of archives into a single directory, and now wishes to evaluate the classification performance of READv2 and KIN on this parameter space.
An illustration of this use-case can be seen by clicking here
badger-archived-runs-ssdeamination-0.02X-TSI/
├── 0percent # badger loop-pipeline -i 20 -- --config gargammel='{pmd-model: briggs, briggs: {ss-deaminations: 0.0}}'
│ ├── archive-metadata.yml
│ ├── run-000/
│ ├── run-001/
│ ...
│ └── run-020/
├── 10percent # badger loop-pipeline -i 20 -- --config gargammel='{pmd-model: briggs, briggs: {ss-deaminations: 0.1}}'
│ ├── archive-metadata.yml
│ ...
│ └── run-020/
├── 20percent # badger loop-pipeline -i 20 -- --config gargammel='{pmd-model: briggs, briggs: {ss-deaminations: 0.2}}'
│ ...
│ └── run-020/
└── 30percent # badger loop-pipeline -i 20 -- --config gargammel='{pmd-model: briggs, briggs: {ss-deaminations: 0.3}}'
...
└── run-020/badger-plots will first require two input yaml files to summarize these simulation results:
- An
input.ymlfile, which directs to the software what data should be used as input to test the methods, as well as how it should be structured. - A
params.ymlfile, which provides the software with plotting parameters.
A yaml input definition file can be generated using the badger-plots make-input module:
mkdir plots
badger-plots make-input --archive-dir badger-archived-runs-ssdeamination-0.02X-TSI/ --subdirs 0percent,10percent,20percent,30percent --methods READv2,KIN > plots/input.yml--archive-dirshould specify a main directory, where all of the desired BADGER archive directories should be located--subdirsis used to provide with a comma-separated list of biological conditions, each corresponding to a subdirectory found within--archive-dir. (Note that this argument can also accept PCRE regular expressions for pattern matching)--methodsis used to provide with a comma-separated list of kinship estimation methods to test.
The previous command should thus generate a yaml file that follows the following base structure, where each input file is hierarchically linked to a given method and biological condition:
0percent:
KIN:
- badger-archived-runs-ssdeamination-0.02X-TSI/run-000/results/04-kinship/KIN/ped1.tar.xz
- ...
READv2:
- badger-archived-runs-ssdeamination-0.02X-TSI/run-000/results/04-kinship/READV2/ped1.tar.xz
- ...
2percent:
KIN: ...
READv2: ...
4percent:
KIN: ...
READ: ...
6percent:
KIN: ...
READ: ...A template params.yml file containing default values for plotting can be generated with the template module. Here, the previously generated input.yml, and the pedigree-codes definition used throughout the simulations can be directly provided:
badger-plots template --input plots/input.yml --pedigree-codes resources/ped-sim/ped-definition/outbred/pedigree_codes.txt --output-dir plots > plots/params.ymlThe resulting plots/params.yml file is a simple yaml configuration file containing default plotting parameters. These default values can of course be modified at leisure.
Important
A detailled summary of every parameter can be found here: badger-plot plotting parameters reference
Note that almost all of the provided parameters are provided with sensible defaults. Thus, input and pedigree-codes are the only two parameters that need to be explicitly set by users.
Once in possession of an input.yml and a properly configured params.yml file, plotting is simply achieved using the plot module:
badger-plots plot --yaml plots/params.yml --threads 32If your workstation is running a version of conda <4.7.0 and you are unable to update it, the installation process of BADGER may be sluggish, as these older versions of conda are single-threaded and made use of a suboptimal solver. In such cases, we recommend the installation of mamba within your base environment. The use of a CONDA_EXE environment may then be applied to redirect the conda-frontend to mamba when running the installation:
CONDA_EXE="mamba" ./badger/install.shWhen setting up badger, the mamba frontend may also be specified as a snakemake option, like so:
badger setup -- --conda-frontend mambaFor any questions or issue related to the use of BADGER, please directly submit an issue on the github page of this repository. Suggestions, feature requests or code contributions are also welcome.
