-
Notifications
You must be signed in to change notification settings - Fork 4
Description
When running the scE2G pipeline with a fragment file containing reads on chrM and using a standard genome file where chrX precedes chrM, the pipeline fails.
The root cause is in the frag_to_tagAlign rule, which uses the command:
sort -k 1,1V -k 2,2n -k3,3n
This command conducts a GNU “version sort” on the chromosomes, which correctly places autosomes in numerical order, but then places nonnumerical chromosomes in alphabetical order. That means chrM goes before chrX.
Bedtools allows a user to define their chromosome name and length in a two column genome file. You can configure this file in scE2G by setting the chr_sizes field in the scE2G config.yaml. The default file, ENCODE_rE2G/reference/GRCh38_EBV.no_alt.chrom.sizes.tsv, ends with:
chr22 50818468
chrX 156040895
chrY 57227415
chrM 16569
If the genome file provided to a downstream bedtools command lists chrX before chrM, bedtools detects that the input stream is not sorted according to the genome file and errors out.
Error: Sorted input specified, but the file - has the following record with a different sort order than the genomeFile ENCODE_rE2G/reference/GRCh38_EBV.no_alt.chrom.sizes.tsv
chrX 13589 13610 N 1000 +
You may also get a downstream error in the create_neighborhoods rule:
AssertionError: Dimension mismatch
This error is thrown by the count_single_feature_for_bed() Python function, and appears to be a secondary symptom of the discrepancy in sorting rules.
The scE2G pipeline should support any chromosome order defined in the user-specified chr_sizes genome file.