Skip to content

Conversation

@NagaComBio
Copy link
Member

Merging hg38 developments into the main (master) branch.

Main updates

  • Hg38 related files are in a separate analysis XML
  • Hg19 changes
    • EVS and ExAC are removed from the annotation in hg19 since they were not used for no-control filtering any changes in the high confidence somatic variants. But this changes the column length in the output file.
    • Default local control AF threshold to 0.05 from 0.01, this might increase the number of final high confidence variants.
    • Used user-defined AF thresholds for the 'SNP_germline_support' annotation instead of hard-coded thresholds, this might change the last counts.

Integrated testing

  • Performed different tests and the results are in Phabricator {T5184#87479}

@NagaComBio NagaComBio requested review from GWarsow and vinjana and removed request for GWarsow March 16, 2021 10:34
Conflicts:
	README.md
	resources/analysisTools/snvPipeline/environments/tbi-lsf-cluster.sh
	resources/analysisTools/snvPipeline/filter_PEoverlap.py
	resources/configurationFiles/analysisSNVCalling.xml
t=l.split('\t')
if (t[0][0] != "#") and isValid(t):
# Skipping the non-primary assembly variants from purity calculations
if t[0].startswith('HLA') or t[0].endswith('_alt'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

t -> fields
l -> line

BTW (2 lines up): line[0] != "#" is much clearer than fields[0][0] != "#" (not to speak of t[0][0]).

# Skipping the non-primary assembly variants from purity calculations
if t[0].startswith('HLA') or t[0].endswith('_alt'):
l=vcf.readline()
continue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is i (next line)? Maybe mapped_chromosome?

# 23.05.2016 JB: Excluded multiallelic SNVs
if ',' in split_line[header.index("ALT")]: continue

# 21.02.2023 NP: Excluded SNVs with 'N' before or after "," in context
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a short word about why you exclude these? I guess it is not obvious because it was added to the SNV workflow only after years of operation.

@@ -1,3 +1,483 @@
<<<<<<< HEAD
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merge error.

Comment on lines +131 to +132
# Reference file for CRAM files
reference_file = args.refFileName
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICS reference_file is only used once, in the CRAM branch of the if-else block below. Moving it down will make it clearer

Suggested change
# Reference file for CRAM files
reference_file = args.refFileName

samfile = pysam.Samfile(args.alignmentFile, mode)
elif args.alignmentFile.split(".")[-1] == "cram":
mode += "c"
samfile = pysam.Samfile(args.alignmentFile, mode, reference_filename = reference_file)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
samfile = pysam.Samfile(args.alignmentFile, mode, reference_filename = reference_file)
# CRAM needs a reference file.
samfile = pysam.Samfile(args.alignmentFile, mode, reference_filename = args.refFileName)

<cvalue name='CHROMOSOME_LENGTH_FILE' value='${hg38BaseDirectory}/stats/GRCh38_decoy_ebv_alt_hla_phiX.fa.chrLength.tsv' type="path" />

<cvalue name="CHROMOSOME_INDICES" value="( 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y HLA ALT )" type="bashArray" description="Chr indices for the calling"/>
<cvalue name="CHROMOSOME_HLA_CONTIGS_FILE" value='${hg38BaseDirectory}/stats/GRCh38_decoy_ebv_alt_hla_phiX.fa.HLA_contigs.bed' type="path" description="HLA contig list"/>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a description field with a short description of the file content and format? If I saw that correctly, it should be suited for samtools mpileup -R $file, right? That would already be a valuable information. Same for the ALT file.

$current_c_chr = $ctrl[0];
$ccoord = $ctrl[1];
if ($tcoord == $ccoord) # matching pair found!
if ($tcoord == $ccoord && $current_t_chr eq $current_c_chr) # matching pair found!
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I see this correctly, the algorithm takes the two position-sorted VCFs and steps forward the feature -- either tumor or control VCF -- that is expected "earlier" in the algorithm, until it finds a match. Right?

I think, here and elsewhere, it would make more sense and would make the code clearer, if the chromosome comparison came before the coordinate comparison, i.e.

Suggested change
if ($tcoord == $ccoord && $current_t_chr eq $current_c_chr) # matching pair found!
if ($current_t_chr eq $current_c_chr && $tcoord == $ccoord) # matching pair found!



chromLength = read.table(file = opt$chromLengthFile, header = F)
chromLength$V1 = gsub("chr", "", chromLength$V1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's good practice to define a header during loading of the table. At least it would add a bit of information about the structure of the table.

}
else{
$all++;
@help = split ("\t", $line);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could also rename this to @line. I think, it also has advantages if two variables that differ only by the type, but not by the content (in principle) are called the same. Some for @head above.

<cvalue name="CHROM_SIZES_FILE" value="${hg38BaseDirectory}/stats/GRCh38_decoy_ebv_alt_hla_phiX.fa.chrLenOnlyACGT_realChromosomes.tsv" type="path" />
<cvalue name='CHROMOSOME_LENGTH_FILE' value='${hg38BaseDirectory}/stats/GRCh38_decoy_ebv_alt_hla_phiX.fa.chrLength.tsv' type="path" />

<cvalue name="CHROMOSOME_INDICES" value="( 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y HLA ALT )" type="bashArray" description="Chr indices for the calling"/>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should add CHROMOSOME_INDICES_PLOTTING with a description.

* 3.0.0
* Major
* Support for hg38/GRCh38 reference genome and variant calling from ALT and HLA contigs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any new mandatory configuration values? At least list them. Details should be in the XML.
Was the semantics of conf. values changed?

* Major
* Support for hg38/GRCh38 reference genome and variant calling from ALT and HLA contigs.
* Minor
* For hg38: Removing mappability and repeat elements' annotations from penalty calculations.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any new optional configuration values? (at least list them)

* skipREMAP: Option to remove repeat elements and mappability from confidence annotations in hg19.
* Removing EVS And ExAC AF from the annotations and no-control workflow filtering
* Support for variant calling from CRAM files
* Bug fix: Removing "quote" around the raw filter option `<RAW_SNV_FILTER_OPTIONS>`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. But this is a technical/code description. What was the effect of the bug to the user? (probably that multiple RAW_SNV_FILTER_OPTIONS could not be used).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you try limiting the (heap) memory consumption of this Python process. See here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is passing over the VCF once. Not sure how much memory it uses, but maybe it is worthwhile limiting the memory with this approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants