Skip to content

scE2G does not support all valid bedtools genome file specifications #96

@kaybrand

Description

@kaybrand

When running the scE2G pipeline with a fragment file containing reads on chrM and using a standard genome file where chrX precedes chrM, the pipeline fails.

The root cause is in the frag_to_tagAlign rule, which uses the command:
sort -k 1,1V -k 2,2n -k3,3n

This command conducts a GNU “version sort” on the chromosomes, which correctly places autosomes in numerical order, but then places nonnumerical chromosomes in alphabetical order. That means chrM goes before chrX.

Bedtools allows a user to define their chromosome name and length in a two column genome file. You can configure this file in scE2G by setting the chr_sizes field in the scE2G config.yaml. The default file, ENCODE_rE2G/reference/GRCh38_EBV.no_alt.chrom.sizes.tsv, ends with:

    chr22	50818468
    chrX	156040895
    chrY	57227415
    chrM	16569

If the genome file provided to a downstream bedtools command lists chrX before chrM, bedtools detects that the input stream is not sorted according to the genome file and errors out.

Error: Sorted input specified, but the file - has the following record with a different sort order than the genomeFile ENCODE_rE2G/reference/GRCh38_EBV.no_alt.chrom.sizes.tsv

chrX    13589   13610   N       1000    +
  

You may also get a downstream error in the create_neighborhoods rule:

AssertionError: Dimension mismatch

This error is thrown by the count_single_feature_for_bed() Python function, and appears to be a secondary symptom of the discrepancy in sorting rules.

The scE2G pipeline should support any chromosome order defined in the user-specified chr_sizes genome file.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions