A Nextflow pipeline for phasing unphased genotype data using Beagle with reference panels from the 1000 Genomes Project.
This pipeline performs haplotype phasing of unphased VCF files using Beagle, a powerful tool for phasing and imputation. The pipeline includes preprocessing steps to prepare both the unphased data and reference panels, performs chromosome-by-chromosome phasing, and postprocesses the results.
- Preprocessing: Quality control and preparation of unphased VCF files
- Reference Panel Processing: Indexing and preparation of 1000 Genomes reference panels
- Phasing: Chromosome-by-chromosome phasing using Beagle
- Postprocessing: Concatenation and final processing of phased results
- Containerized: Uses Singularity containers for reproducibility
- Scalable: Supports SLURM cluster execution
- Nextflow (>= 22.04.0)
- Singularity (for containerized execution)
- SLURM (for cluster execution, optional)
-
Clone the repository:
git clone <repository-url> cd Phasing
-
Configure parameters: Edit
params/params_beagle.ymlwith your input files:vcf_unphased: ./data/your_unphased.vcf.gz refcsv: ./params/files_beagle.csv outdir: ./results/ output_prefix: phased_output
-
Run the pipeline:
# Local execution nextflow run main.nf -params-file params/params_beagle.yml -profile local # SLURM cluster execution nextflow run main.nf -params-file params/params_beagle.yml -profile kutral
- Unphased VCF: A VCF/BCF file containing unphased genotypes (must be indexed)
- Reference CSV: A CSV file with columns:
chr: Chromosome identifierref_vcf: Path to reference VCF file for that chromosomeref_vcf_index: Path to reference VCF index filegmap: Path to genetic map file
chr,ref_vcf,ref_vcf_index,gmap
1,/path/to/chr1_ref.vcf.gz,/path/to/chr1_ref.vcf.gz.csi,/path/to/chr1.gmap
2,/path/to/chr2_ref.vcf.gz,/path/to/chr2_ref.vcf.gz.csi,/path/to/chr2.gmap
...
-
Preprocessing:
- Index VCF files
- Fill AC (allele count) annotations
- Remove duplicate variants
- Remove missing genotypes
- Prepare reference panels
-
Phasing:
- Extract chromosome-specific regions
- Run Beagle phasing with reference panels
- Index phased output
-
Postprocessing:
- Concatenate chromosome-specific results
- Generate final phased VCF
The pipeline generates:
- Phased VCF files per chromosome:
phased_<chr>.vcf.gz - Concatenated phased VCF:
<output_prefix>.vcf.gz - Pipeline execution reports in
pipeline_info/
- local: For local execution with Singularity
- kutral: For SLURM cluster execution on the
ngen-koqueue
Default resource allocation:
- Memory: 120GB per process
- CPUs: 16 per process
Adjust in nextflow.config if needed.
Key parameters (set in params/params_beagle.yml):
| Parameter | Description | Default |
|---|---|---|
vcf_unphased |
Path to unphased VCF file | - |
refcsv |
Path to reference CSV file | - |
outdir |
Output directory | ./results/ |
output_prefix |
Prefix for output files | - |
- Beagle: Haplotype phasing and imputation
- bcftools: VCF/BCF manipulation and indexing
- Nextflow: Workflow orchestration
If you use this pipeline, please cite:
- Beagle: Browning, B. L., & Browning, S. R. (2016). Genotype imputation with millions of reference samples. The American Journal of Human Genetics, 98(1), 116-126.
See LICENSE file for details.
Gabriel Cabas
For issues and questions, please open an issue on the repository.
