README.md

Artifact Removal by Classifier (ARC):

Goal/Purpose:
Artifact Removal by Classifier (ARC) is a supervised random forest model designed to distinguish true rare de novo variants (RDNVs) from lymphoblastoid cell line (LCL) specific genetic aberrations or other types of artifacts, such as sequencing and mapping errors.

Background/Overview:
ARC was developed and published as part of the iHART project; please reference our manuscript for full details. We trained ARC on RDNVs identified in 76 pairs of fully phase-able monozygotic (MZ) twins with whole-genome sequence (WGS) data derived from LCL DNA, using 48 features representing intrinsic genomic properties, (e.g., GC content and properties associated with de novo hotspots), sample specific properties (e.g., genome-wide number of de novo SNVs), signatures of transformation of peripheral B lymphocytes by Epstein-Barr virus (e.g., number of de novo SNVs in immunoglobulin genes), or variant properties (e.g., GATK variant metrics). We subsequently tested ARC on RDNVs identified in 17 fully phase-able whole blood (WB) and matched LCL samples with WGS data. The resulting random forest classifier achieved an area under the receiver operating characteristic (ROC) curve of 0.99 and 0.98 in the training and test set, respectively. We selected a conservative ARC score threshold (0.4) that achieved a minimum precision and recall rate of >0.9 and ~0.8, respectively, across all 10-folds of the training set cross validation; and achieved a precision and recall rate of >0.9 and >0.8, respectively, in the test set.

Pipelines:
There are two pipelines required for ARC (both contained within this repository):

Annotation
Classification

Assumptions made by the pipeline:
This pipeline assumes that you have Perl, Python, ANNOVAR, BEDTools, Gawk, and Reference Genome human_g1k_v37 available on your system. Please follow the URLs provided below to download any resources you may need. Please note that ARC was only tested using the version of each resource listed below.

Perl: Tested on Perl version 5.10.1. The latest version of Perl (5.28.0) is available at https://www.perl.org/get.html.
Python: Tested on Python versions 2.7.6+ and 3.6.4+. Multiple release versions of Python packages are available at https://www.python.org/downloads/. This pipeline assumes that you are running Python version 2.7 or later. In addition, Python libraries pandas, sklearn, scipy, and matplotlib are also required for the pipeline.
ANNOVAR: Tested on ANNOVAR version 2015Dec14. A link to the latest version of ANNOVAR (2018Apr16) is available at http://annovar.openbioinformatics.org/en/latest/user-guide/download/ (registration required).
BEDTools: Tested on BEDTools version 2.20.1. Instructions on how to install the latest version of BEDTools (2.25.0) are available at https://bedtools.readthedocs.io/en/latest/content/installation.html.
Gawk: Tested on Gawk version 4.1.4. Instructions on how to get and extract the latest version of the gawk distribution (4.2.1) are available at https://www.gnu.org/software/gawk/manual/html_node/Getting.html#Getting.
Reference Genome: Reference Genome hs37d5 is available at http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/, under filename human_g1k_v37.fasta.gz.

The variants were identified using GATK (v3.2.2) and raw RDNVs were identified as described in our manuscript.

Setup/How to get started:

Step 1: Get the code

Click on the green "Clone or Download" button on the top right hand corner.

Option 1: Download

Select "Download Zip".
Move this zipped folder to your desired folder.

Option 2: Clone

Copy the URL provided.
Navigate to the directory where you would like to copy the repository using the command line.
Clone the repository using the following command:

git clone <URL>

Step 2: Unzip Genome in a Bottle and Replication Timing Annotation Source files

Navigate to the Annotation_Source_Files directory

cd Annotation_Source_Files

Unzip Genome in a Bottle file:

gunzip Genome_in_a_bottle_problematic_regions.bed.gz

Unzip Replication Timing file:

gunzip hg19_ReplicationTiming.txt.gz

Step 3: Download and Modify Additional Annotation Source files
Download Signal from ENCODE/Caltech GM12878 RNA-seq

Navigate to the UCSC Genome Browser directory that contains the downloadable files associated with this ENCODE composite track: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeCaltechRnaSeq/
Download filename wgEncodeCaltechRnaSeqGm12878R2x75Th1014Il200SigRep1V4.bigWig (“ENCODE file”)
Download UCSCs BigWigToWig Tool:
a. Navigate to the UCSC Genome Browser Software directory: http://hgdownload.soe.ucsc.edu/admin/exe/
b. Select your proper machine type
c. Download filename bigWigToWig
Use the following command to convert the ENCODE file from bigWig format to wig format:

./bigWigToWig wgEncodeCaltechRnaSeqGm12878R2x75Th1014Il200SigRep1V4.bigWig wgEncodeCaltechRnaSeqGm12878R2x75Th1014Il200SigRep1V4.wig

Note: You may need to make bigWigToWig executable by using the following command:

chmod +x ./bigWigToWig

Covert ENCODE file into a format that is appropriate for the pipeline
Usage:

bash Make_EncodeFile.sh <input: ENCODE file> <output_dir: Output dir>

Move the hg19_EncodeCaltechRnaSeqGm12878R2x75.txt file produced by the Make_EncodeFile.sh script to the Annotation_Source_Files directory

Download RepeatMasker sequence

Navigate to the UCSC genome annotation database for the hg19 Human Reference Genome: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/
Download filename rmsk.txt.gz (“RMSK file”)
Unzip the rmsk.txt.gz file

gunzip rmsk.txt.gz

Convert the unzipped RMSK file ("rmsk.txt") into a format that is appropriate for the pipeline Usage:

bash Make_RMSK.sh <input: unzipped RMSK file> <output_dir: Output dir>

Move the hg19_rmsk.bed file produced by the Make_RMSK.sh script to the Annotation_Source_Files directory.

Overview of the practice data
We provide a small set of RDNVs (n = 25) as practice data. We recommend running both pipelines: the annotation pipeline followed by the classification pipeline, on these practice data before applying ARC to your own data. These practice data include two files: a list of RDNVs and a VCF.

A tab delimited list of RDNVs (“RDNV flat file”).
The practice RDNV flat file is: iHART_25_denovo_variants_ARC_practice_data_RDNV_flat_file.db

The RDNV file should contain all identified de novo variants in all of your samples.
The required columns are: Chr, Position, Ref, Alt, child_id, Info, genotype, genotype_subfields, and subfield_format.
- Note: The pipeline does not support periods or other non-alphanumeric characters in child_id.

Example of RDNV flat file:

Chr        Position        Endpos        Ref        Alt        esp6500siv2_all        ExAC_ALL        1000g2015aug_all        cg46        CADD        CADD_Phred        Polyphen2_HDIV_score        Polyphen2_HDIV_pred        Polyphen2_HVAR_score        Polyphen2_HVAR_pred        consequence        gene_symbol        family        child_id        is_aff        inheritance_type        Info        genotype        genotype_subfields        subfield_format        max_control_AF        HNP_AF        adjusted_hom_freq_PSP        adjusted_hom_freq_HNP        %PSP_samples_missing        %HNP_samples_missing        RDNV_category        GIAB_flag_count        HGNC_symbol        HGNC_mapping        %RVIS_ESP_0.1%        %RVIS_ExAC_0.01%        %RVIS_ExAC_0.05%popn        %RVIS_ExAC_0.1%popn        constraint_score        pLI        pRecessive        pNull        pHI        chrN:N-N        pph2_prediction        pph2_prob        
1        99034        NA        T        G        NA        NA        NA        NA        NA        NA        NA        NA        NA        NA        intron_variant&non_coding_transcript_variant        RP11-34P13.7        FamilyA        iHART1945        1        denovo        PASS;ABHet=0.694;ABHom=1;AC=2;AF=0;AN=4480;BaseQRankSum=1.49;DP=73656;ExcessHet=3.226;FS=0;InbreedingCoeff=-0.0075;MLEAC=2;MLEAF=0.0004378;MQ=52.6;MQ0=0;MQRankSum=0.225;NDA=5;QD=9.21;ReadPosRankSum=-1.091;SOR=0.523;VQSLOD=-4.97;VariantType=SNP;culprit=MQ;cytoBand=1p36.33;Func=ncRNA_intronic;Gene=RP11-34P13.7;genomicSuperDups=0.992727,chr1:235525;FATHMM_c=0.08231;FATHMM_nc=0.00333;CSQ=G|intron_variant&non_coding_transcript_variant|MODIFIER|RP11-34P13.7|ENSG00000238009|Transcript|ENST00000466430|lincRNA||2/3|ENST00000466430.1:n.264-6794A>C|||||||||-1|SNV|Clone_based_vega_gene||YES|||||||||||||||||||||||||||||||||||||||||        0/1        0.67:10,5:.:99:180,0,518        GT:AB:AD:DP:GQ:PL        0        0.0        NA        0.0        NA        2.81995661605        somatic        0        NA        NA        NA        NA        NA        NA        NA        NA        NA        NA        NA        NA        NA        NA        
1        12878589        NA        GT        G        NA        NA        NA        NA        NA        NA        NA        NA        NA        NA        downstream_gene_variant        RP5-845O24.8        FamilyB        iHART2634        1        denovo        PASS;AC=1;AF=0;AN=4600;BaseQRankSum=-0.169;DP=66478;ExcessHet=3.0103;FS=2.937;InbreedingCoeff=-0.0001;MLEAC=1;MLEAF=0.0002133;MQ=58.02;MQ0=0;MQRankSum=-0.659;NDA=1;QD=7.51;ReadPosRankSum=0.751;SOR=1.197;VQSLOD=-1.155;VariantType=DELETION.NumRepetitions_6.EventLength_1.RepeatExpansion_T;culprit=FS;cytoBand=1p36.21;Func=intergenic;Gene=PRAMEF1,RP5-845O24.8;GeneDetail=21813,3959;genomicSuperDups=0.955807,chr1:13210222;CSQ=-|downstream_gene_variant|MODIFIER|RP5-845O24.8|ENSG00000228338|Transcript|ENST00000438401|lincRNA|||||||||||3959|-1|deletion|Clone_based_vega_gene||YES|||||||||||||||||||||||||||||||||||||||||        0/1        21,16:.:99:341,0,523        GT:AD:DP:GQ:PL        0        0.0        NA        0.0        NA        0.108459869848        somatic        0        NA        NA        NA        NA        NA        NA        NA        NA        NA        NA        NA        NA        NA        NA        
3        194314892        NA        C        CTGTCTG        NA        NA        NA        NA        NA        NA        NA        NA        NA        NA        intron_variant        TMEM44        FamilyC        iHART1417        1        denovo        PASS;AC=1;AF=0;AN=4582;BaseQRankSum=-0.413;DP=88579;ExcessHet=4.2252;FS=3.88;InbreedingCoeff=-0.0038;MLEAC=1;MLEAF=0.0002141;MQ=62.8;MQ0=0;MQRankSum=-0.945;NDA=2;NEGATIVE_TRAIN_SITE;QD=11.41;ReadPosRankSum=-0.083;SOR=1.508;VQSLOD=-0.9498;VariantType=INSERTION.NOVEL_6;culprit=FS;cytoBand=3q29;Func=intronic;Gene=TMEM44;CSQ=TGTCTG|intron_variant|MODIFIER|TMEM44|ENSG00000145014|Transcript|ENST00000392432|protein_coding||10/10|ENST00000392432.2:c.1318-5525_1318-5524insCAGACA|||||||||-1|insertion|HGNC|25120|YES|||CCDS54699.1|ENSP00000376227|TMM44_HUMAN|Q96I73_HUMAN|UPI00015E0940||||||||||||||||||||||||||||||||||        0/1        9,5:.:99:222,0,402        GT:AD:DP:GQ:PL        0        0.0        NA        0.0        NA        0.867678958785        somatic        0        TMEM44        NA        96.84        68.34        76.25        34.77        -2.472364196        3.67048384751758e-05        0.567721303337738        0.432241991823787        0.0212399169818057        chr3:194308402-194354418        NA        NA

A VCF file.
The practice VCF file is: iHART_25_denovo_variants_ARC_practice_data.vcf

This VCF should contain variants from all of your samples (a multi-sample jointly genotyped VCF).
This VCF should include variants from all chromosomes.
- Again, please note that ARC expects hg19/b37 coordinates and that this VCF will include GATK variant information.

Example of VCF file:

*Header lines excluded*
#CHROM        POS        ID        REF        ALT        QUAL        FILTER        INFO        FORMAT        iHART1327        iHART1010        iHART1121        iHART1265        iHART1718        iHART1672        iHART1417        iHART2055        iHART1945        iHART2158        iHART2253        iHART2268        iHART2292        iHART2116        iHART1970        iHART2341        iHART2344        iHART2424        iHART2634        iHART2690        iHART2951        iHART2726        iHART3069        iHART3081        iHART1015
1        99034        .        T        G        303.89        PASS        ABHet=0.694;ABHom=1;AC=2;AF=0;AN=4480;BaseQRankSum=1.49;DP=73656;ExcessHet=3.226;FS=0;InbreedingCoeff=-0.0075;MLEAC=2;MLEAF=0.0004378;MQ=52.6;MQ0=0;MQRankSum=0.225;NDA=5;QD=9.21;ReadPosRankSum=-1.091;SOR=0.523;VQSLOD=-4.97;VariantType=SNP;culprit=MQ;cytoBand=1p36.33;Func=ncRNA_intronic;Gene=RP11-34P13.7;genomicSuperDups=0.992727,chr1:235525;FATHMM_c=0.08231;FATHMM_nc=0.00333;CSQ=G|intron_variant&non_coding_transcript_variant|MODIFIER|RP11-34P13.7|ENSG00000238009|Transcript|ENST00000466430|lincRNA||2/3|ENST00000466430.1:n.264-6794A>C|||||||||-1|SNV|Clone_based_vega_gene||YES|||||||||||||||||||||||||||||||||||||||||        GT:AB:AD:DP:GQ:PL        0/0:.:14,0:14:32:0,32,405        0/0:.:9,0:9:24:0,24,360        0/0:.:9,0:9:24:0,24,360        0/0:.:34,0:34:69:0,69,1035        0/0:.:16,0:16:22:0,22,471        0/0:.:9,0:9:21:0,21,267        0/0:.:13,0:13:21:0,21,442        0/0:.:10,0:10:15:0,15,287        0/1:0.67:10,5:.:99:180,0,518        0/0:.:20,0:20:21:0,21,519        0/0:.:9,0:9:24:0,24,360        0/0:.:15,0:15:42:0,42,630        0/0:.:25,0:25:12:0,12,693        0/0:.:12,0:12:25:0,25,366        0/0:.:23,0:23:45:0,45,644        0/0:.:14,0:14:39:0,39,405        0/0:.:11,0:11:30:0,30,450        0/0:.:28,0:28:69:0,69,1035        0/0:.:25,0:25:60:0,60,900        0/0:.:26,0:26:66:0,66,990        0/0:.:37,0:37:45:0,45,1108        0/0:.:29,0:29:40:0,40,874        0/0:.:15,0:15:36:0,36,540        0/0:.:28,0:28:63:0,63,945        0/0:.:34,0:34:93:0,93,1395
1        12878589        .        GT        G        277.84        PASS        AC=1;AF=0;AN=4600;BaseQRankSum=-0.169;DP=66478;ExcessHet=3.0103;FS=2.937;InbreedingCoeff=-0.0001;MLEAC=1;MLEAF=0.0002133;MQ=58.02;MQ0=0;MQRankSum=-0.659;NDA=1;QD=7.51;ReadPosRankSum=0.751;SOR=1.197;VQSLOD=-1.155;VariantType=DELETION.NumRepetitions_6.EventLength_1.RepeatExpansion_T;culprit=FS;cytoBand=1p36.21;Func=intergenic;Gene=PRAMEF1,RP5-845O24.8;GeneDetail=21813,3959;genomicSuperDups=0.955807,chr1:13210222;CSQ=-|downstream_gene_variant|MODIFIER|RP5-845O24.8|ENSG00000228338|Transcript|ENST00000438401|lincRNA|||||||||||3959|-1|deletion|Clone_based_vega_gene||YES|||||||||||||||||||||||||||||||||||||||||        GT:AD:DP:GQ:PL        0/0:43,0:43:70:0,70,1329        0/0:32,0:32:69:0,69,972        0/0:27,0:27:61:0,61,823        0/0:29,0:29:62:0,62,868        0/0:29,0:29:63:0,63,882        0/0:22,0:22:60:0,60,659        0/0:30,0:30:63:0,63,839        0/0:29,0:29:60:0,60,906        0/0:30,0:30:75:0,75,949        0/0:26,0:26:61:0,61,829        0/0:21,0:21:60:0,60,900        0/0:33,0:33:72:0,72,994        0/0:30,0:30:64:0,64,904        0/0:30,0:30:64:0,64,970        0/0:26,0:26:63:0,63,817        0/0:33,0:33:77:0,77,999        0/0:24,0:24:60:0,60,921        0/0:36,0:36:70:0,70,1046        0/1:21,16:.:99:341,0,523        0/0:32,0:32:71:0,71,945        0/0:21,0:21:60:0,60,900        0/0:26,0:26:62:0,62,840        0/0:39,0:39:89:0,89,1164        0/0:27,0:27:63:0,63,853        0/0:29,0:29:69:0,69,905
3        194314892        .        C        CTGTCTG        159.72        PASS        AC=1;AF=0;AN=4582;BaseQRankSum=-0.413;DP=88579;ExcessHet=4.2252;FS=3.88;InbreedingCoeff=-0.0038;MLEAC=1;MLEAF=0.0002141;MQ=62.8;MQ0=0;MQRankSum=-0.945;NDA=2;NEGATIVE_TRAIN_SITE;QD=11.41;ReadPosRankSum=-0.083;SOR=1.508;VQSLOD=-0.9498;VariantType=INSERTION.NOVEL_6;culprit=FS;cytoBand=3q29;Func=intronic;Gene=TMEM44;CSQ=TGTCTG|intron_variant|MODIFIER|TMEM44|ENSG00000145014|Transcript|ENST00000392432|protein_coding||10/10|ENST00000392432.2:c.1318-5525_1318-5524insCAGACA|||||||||-1|insertion|HGNC|25120|YES|||CCDS54699.1|ENSP00000376227|TMM44_HUMAN|Q96I73_HUMAN|UPI00015E0940||||||||||||||||||||||||||||||||||        GT:AD:DP:GQ:PL        0/0:49,0:49:99:0,120,1800        0/0:41,0:41:99:0,111,1665        0/0:30,0:30:61:0,61,917        0/0:42,0:42:99:0,103,1321        0/0:37,0:37:69:0,69,1035        0/0:32,0:32:87:0,87,1305        0/1:9,5:.:99:222,0,402        0/0:33,0:33:65:0,65,963        0/0:26,0:26:61:0,61,819        0/0:27,0:27:60:0,60,811        0/0:22,0:22:60:0,60,900        0/0:33,0:33:81:0,81,1044        0/0:39,0:39:99:0,99,1258        0/0:48,0:48:99:0,108,1620        0/0:34,0:34:39:0,39,965        0/0:40,0:40:75:0,75,1125        0/0:29,0:29:52:0,52,853        0/0:36,0:36:64:0,64,1095        0/0:35,0:35:81:0,81,1215        0/0:50,0:50:99:0,108,1620        0/0:53,0:53:99:0,102,1530        0/0:40,0:40:84:0,84,1260        0/0:29,0:29:60:0,60,900        0/0:34,0:34:87:0,87,1305        0/0:39,0:39:87:0,87,1305

Recalculate ABHet for all RDNVs:

Purpose of ABHet recalculation:
The ABHet annotation is not currently provided for indels by GATK. Using the ABHet formula (below), we manually calculate the ABHet value for all SNVs and indels. Critically, if your dataset includes duplicates or monozygotic twins, you should exclude one of the duplicates by providing a list of duplicate samples when recalculating ABHet.

Usage:

python calculate_abhet_collapsing_duplicates.py --vcf <VCF from which to calculate abhet> --output_dir <Output dir> [Optional: --variants, --samples_to_exclude_when_true] (The pipeline expects this to be "unrenamed PAR", i.e., sex chromosomes should be "X" and "Y".)

Options:

--variants: provide a tab separated: chrom, pos, ref, alt, comma-separated list of TRUE variants. If empty, will use all variants in VCF.

--samples_to_exclude_when_true: list of samples (e.g. WB or MZ_twin) to exclude when concordant = TRUE). Variants found in these samples will be excluded.

Note: manual_abhet will match adjusted_abhet for SNVs when no samples to exclude are listed.

Note: This script takes in one multi-sample VCF file with all chromosomes included.

Output (VCF_file_name_with_abhet_recalculation.txt):

chr        pos        snp_id        ref        alt        gatk_abhet        manual_abhet        samples_excluded        adjusted_abhet
1        99034        .        T        G        0.694        0.667        NA        0.667
1        12878589        .        GT        G        None        0.568        NA        0.568
3        194314892        .        C        CTGTCTG        None        0.643        NA        0.643

Run the annotation pipeline: Steps 1-4 prepare your input files by reformatting to pipeline compatible formats.

We have provided an "Output" directory to help organize the output from the annotation and classification pipeline.
If you run into issues at any point in the annotation pipeline, please feel free to check out our Issues.md page for solutions.

Step 1: Make a version of the RDNV flat file that the pipeline understands.

Part 1: Specify Gawk directory in makeOwnFlatDb.sh (provided in the Scripts directory)

vi makeOwnFlatDb.sh

Set AWK variable (makeOwnFlatDb.sh, Line 3) to the absolute path of the Gawk software directory.

AWK="" #Insert absolute path to Gawk

Part 2: Run the makeOwnFlatDb.sh script

bash makeOwnFlatDb.sh <input: RDNV flat file> <output_file: desired output file name and directory for pipeline-specific flat file>

Output:

1        99034        T        G        iHART1945        GT:AB:AD:DP:GQ:PL        0/1:0.67:10,5:.:99:180,0,518        PASS;ABHet=0.694;ABHom=1;AC=2;AF=0;AN=4480;BaseQRankSum=1.49;DP=73656;ExcessHet=3.226;FS=0;InbreedingCoeff=-0.0075;MLEAC=2;MLEAF=0.0004378;MQ=52.6;MQ0=0;MQRankSum=0.225;NDA=5;QD=9.21;ReadPosRankSum=-1.091;SOR=0.523;VQSLOD=-4.97;VariantType=SNP;culprit=MQ;cytoBand=1p36.33;Func=ncRNA_intronic;Gene=RP11-34P13.7;genomicSuperDups=0.992727,chr1:235525;FATHMM_c=0.08231;FATHMM_nc=0.00333;CSQ=G|intron_variant&non_coding_transcript_variant|MODIFIER|RP11-34P13.7|ENSG00000238009|Transcript|ENST00000466430|lincRNA||2/3|ENST00000466430.1:n.264-6794A>C|||||||||-1|SNV|Clone_based_vega_gene||YES|||||||||||||||||||||||||||||||||||||||||
1        12878589        GT        G        iHART2634        GT:AD:DP:GQ:PL        0/1:21,16:.:99:341,0,523        PASS;AC=1;AF=0;AN=4600;BaseQRankSum=-0.169;DP=66478;ExcessHet=3.0103;FS=2.937;InbreedingCoeff=-0.0001;MLEAC=1;MLEAF=0.0002133;MQ=58.02;MQ0=0;MQRankSum=-0.659;NDA=1;QD=7.51;ReadPosRankSum=0.751;SOR=1.197;VQSLOD=-1.155;VariantType=DELETION.NumRepetitions_6.EventLength_1.RepeatExpansion_T;culprit=FS;cytoBand=1p36.21;Func=intergenic;Gene=PRAMEF1,RP5-845O24.8;GeneDetail=21813,3959;genomicSuperDups=0.955807,chr1:13210222;CSQ=-|downstream_gene_variant|MODIFIER|RP5-845O24.8|ENSG00000228338|Transcript|ENST00000438401|lincRNA|||||||||||3959|-1|deletion|Clone_based_vega_gene||YES|||||||||||||||||||||||||||||||||||||||||
3        194314892        C        CTGTCTG        iHART1417        GT:AD:DP:GQ:PL        0/1:9,5:.:99:222,0,402        PASS;AC=1;AF=0;AN=4582;BaseQRankSum=-0.413;DP=88579;ExcessHet=4.2252;FS=3.88;InbreedingCoeff=-0.0038;MLEAC=1;MLEAF=0.0002141;MQ=62.8;MQ0=0;MQRankSum=-0.945;NDA=2;NEGATIVE_TRAIN_SITE;QD=11.41;ReadPosRankSum=-0.083;SOR=1.508;VQSLOD=-0.9498;VariantType=INSERTION.NOVEL_6;culprit=FS;cytoBand=3q29;Func=intronic;Gene=TMEM44;CSQ=TGTCTG|intron_variant|MODIFIER|TMEM44|ENSG00000145014|Transcript|ENST00000392432|protein_coding||10/10|ENST00000392432.2:c.1318-5525_1318-5524insCAGACA|||||||||-1|insertion|HGNC|25120|YES|||CCDS54699.1|ENSP00000376227|TMM44_HUMAN|Q96I73_HUMAN|UPI00015E0940||||||||||||||||||||||||||||||||||

Description: This script creates a database from the RDNV flat file. The first step in creating an RDNV file that the annotation pipeline understands, is to remove the PAR annotation from the X & Y chromosome variants if these variants are labeled with ‘PAR’. This script then extracts the required columns from the RDNV flat file (Columns: Chr, Position, Ref, Alt, child_id, subfield_format, genotype: genotype_subfields, Info) and outputs a tab-delimited file with 8 columns. Additional columns present in the RDNV flat file are ignored, and the required columns are organized into the proper format for the pipeline. This output file will be referenced by the data_config.sh script in Step 4.

Step 2: Create a Variant_ID for each variant.
Usage:

bash makeAnnotationInputFileFromOwnFlatDb.sh <input: pipeline-specific flat file> <output_file: desired output file name and directory for annotation input file>

Output:

1.99034.T.G.iHART1945        TRUE
1.12878589.GT.G.iHART2634        FALSE
3.194314892.C.CTGTCTG.iHART1417        FALSE

Note: The Classification column is filled with FALSE everywhere except the first row in order to prevent the AUC calculation from crashing.

Description: This script outputs a tab delimited file with 2 columns, Variant_ID and Classification. The Variant_ID is made up of the first 5 columns from the output of Step 1 delimited by periods rather than a tab (e.g., Chr.Pos.Ref.Alt.Child). The Classification column is ignored when running the existing ARC model on new data; when developing the ARC model this column was used to assign true vs. false variant status in the training and the test set. This output file (“Variant ID file”) will be used as the input for the getAnnotation.sh script in Step 5.

Step 3: Make a version of the ABHet file that the pipeline understands.
Usage:

bash makeOwnABHetFile.sh <input: ABHet file with 9 columns and header> <output_file: desired output file name and directory for pipeline-specific ABHet
file>

Output:

1        99034        T        G        0.667
1        12878589        GT        G        0.568
3        194314892        C        CTGTCTG        0.643

Description: This script reformats the recalculated ABHet file with 9 columns into an ABHet file with no header and 5 columns (chr, pos, ref, alt, and adjusted_abhet) that are tab delimited. This output file will be used in the data_config.sh script in Step 4.

Step 4: Manually input the paths for your pipeline-specific RDNV flat file (output of Step 1) and your pipeline-specific ABHet file (output of Step 3) into the data_config.sh script.

vi data_config.sh

FLAT= # Absolute path to pipeline-specific RDNV flat file (output of Step 1)
ABHET= # Absolute path to pipeline-specific ABHet file (output of Step 3)

Description: The purpose of this step is to change the data_config.sh script to have absolute paths to the pipeline formatted RDNV flat file and the pipeline formatted ABHet file. The data_config.sh script is called within the setAnnotationEnv.sh script which is part of the getAnnotation.sh script in Step 6.

Step 5: Manually assign variables in setAnnotationEnv.sh script, lines 12-20 with the absolute paths specified.

vi setAnnotationEnv.sh

IG_LIST="" #Absolute path to Human_immunoglobulin_genes_HGNClist.txt file provided in the Annotation_Source_Files directory
GIAB="" #Absolute path to Genome_in_a_bottle_problematic_regions.bed file provided in the Annotation_Source_Files directory

ANNOVAR="" #Absolute path path to your preferred perl installation followed by an absolute path to the table_annovar.pl script from ANNOVAR, separated by a space
ANV_DIR=""; #Absolute path to Annotation_Source_Files directory
AWK="" #Absolute path to Gawk software directory
BED="" #Absolute path to BEDTools toolset directory 

REFG="" # Absolute path to Human Reference Genome hs37d5 file

Description: The purpose of this step is to change the setAnnotationEnv.sh script to have absolute paths to the necessary directories, tools, and files for the getAnnotation.sh script in Step 6.

Step 6: Run the annotation pipeline.
Usage:

bash getAnnotation.sh <input: annotation input file (output of Step 2 above)> <output_dir: output dir for annotation .out and intermediate files>

Note: Be sure to provide absolute path to the annotation input file, since this absolute path is required by the getAnnotation.sh script.
Note: This produces many intermediate files as well as a final .out file in the output directory.

Output (annotation .out file):

Variant_ID        CLASSIFICATION        AbpartsTCR        DnaseUwdukeGm12878UniPk        EncodeCaltechRnaSeqGm12878R2x75        RecombRate        ReplicationTiming        SimpleRepeat        Rmsk        SegDups        Top10multiallelic        SAMPLE_AD_REF        SAMPLE_AD_ALT        SAMPLE_DP        SAMPLE_GQ        SAMPLE_ADDP        DP        ABHet        ABHom        AN        BaseQRankSum        FS        InbreedingCoeff        MQ        MQRankSum        NDA        NEGATIVE_TRAIN_SITE        OND        POSITIVE_TRAIN_SITE        QD        ReadPosRankSum        SOR        VQSLOD        IG        Num_vars_in_IG        Num_vars_in_TCR_abparts        Trinucleotide_context        PCT_GC_content_100bp        All.simplerepeatsnocov        Decoy.bed        Superdupsmerged_all_sort        Systematic.Sequencing.Errors        VQSRv2.18_filterABQD_sorted        VQSRv2.18_filterAlign_sorted        VQSRv2.18_filterConflicting        VQSRv2.18_filterlt2Datasets        VQSRv2.18_filterMap        VQSRv2.18_filterSSE        Num_uncertain_regions        isIndel        isDnaseNA        isEncodeNA        isRecombRateNA        isReplicationTimingNA        isABHetNA        isABHomNA        isBaseQRankSumNA        isInbreedingNA        isMQRankSumNA        isReadPosRankSumNA        Num_SNV
1.99034.T.G.iHART1945        TRUE        0        0        0        1.11575        NA        0.78        1        1        0        10        5        15        99        0.333333        73656        0.667        1        4480        1.49        0        -0.0075        52.6        0.225        5        0        0        0        9.21        -1.091        0.523        -4.97        0        0        0        AAT        0.148515        0        0        0        0        0        0        0        0        0        0        0        0        0        0        0        1        0        0        0        0        0        0        1
1.12878589.GT.G.iHART2634        FALSE        0        0        0        0.998319        NA        0        1        1        0        21        16        37        99        0.432432        66478        0.568        NA        4600        -0.169        2.937        -0.0001        58.02        -0.659        1        0        0        0        7.51        0.751        1.197        -1.155        0        0        0        ACA        0.405941        0        0        0        0        0        0        0        0        0        0        0        1        0        0        0        1        0        1        0        0        0        0        0
3.194314892.C.CTGTCTG.iHART1417        FALSE        0        0        1        2.45749        1.000989        1.49        1        0        0        9        5        14        99        0.357143        88579        0.643        NA        4582        -0.413        3.88        -0.0038        62.8        -0.945        2        1        0        0        11.41        -0.083        1.508        -0.9498        0        0        0        AGA        0.316832        0        0        0        0        0        0        0        0        0        0        0        1        0        0        0        0        0        1        0        0        0        0        0

Description: This script annotates the Variant ID file with all ARC features and outputs the final .out file along with other intermediate files in the output directory. The getAnnotation.sh script calls five other scripts: (1) setAnnotationEnv.sh, (2) getAnnotation.annovar.sh, (3) getAnnotation.gatk.sh, (4) getAnnotation.other.sh, and (5) getAnnotation.adjust.perSampleCounts.sh. It also recalculates the per-sample count columns as needed. This output file is the Annotated Variant ID file and is the input for Step 7.

Run the classification pipeline:

Step 7: Run testRF.py.
Usage:

python testRF.py <input: Annotated Variant ID file> <output_file: desired output file name and directory for RDNV classification>

NB: Running this script on a large number of variants (we estimate >100,000 variants) can potentially crash your computer. If you think this might be an issue for you, use the preimputation.py script first and then run the above command with the skip_imputation argument.
python preimputation.py --input_annotation <input: Annotated Variant ID file (from Step 5)> <output_file: desired output file name and directory for preimputed .out file from annotation pipeline>

Output:

        ARC_Score        Classification
0        0.089        1.0
1        0.505        0.0
2        0.076        0.0
3        0.318        0.0

Description: The script calculates the ARC score, column name "ARC_Score", for each variant and generates a single indexed list of ARC scores. We recommend selecting a threshold of 0.4. We consider variants with an ARC score of <0.4 likely to be sequencing or cell line artifacts and a score of ≥0.4 to be true de novo variants. This script generates your “classification output file” which is used in Step 8.

Step 8: Recombine the ARC scores from your RDNV Classification output file (output of Step 7) with your Variant ID file (output of Step 6).
Usage:

paste <(awk '{print $1}' <annotation .out file>) <(awk 'BEGIN{OFS = '\t'; print "ARC_Score"} (NR != 1) {print $2}' <classification output file>) > <output_file: desired output file name and directory for Classification output file with Variant ID>

Output:

Variant_ID  ARC_Score
1.99034.T.G.iHART1945 0.089
1.12878589.GT.G.iHART2634 0.505
3.194314892.C.CTGTCTG.iHART1417 0.076

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

README.md

Artifact Removal by Classifier (ARC):

Setup/How to get started:

Recalculate ABHet for all RDNVs:

Run the annotation pipeline: Steps 1-4 prepare your input files by reformatting to pipeline compatible formats.

Run the classification pipeline:

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 124 Commits
Annotation_Source_Files		Annotation_Source_Files
Output		Output
Practice_Data		Practice_Data
Scripts		Scripts
Issues.md		Issues.md
README.md		README.md

walllab/iHART-ARC

Folders and files

Latest commit

History

Repository files navigation

README.md

Artifact Removal by Classifier (ARC):

Setup/How to get started:

Recalculate ABHet for all RDNVs:

Run the annotation pipeline: Steps 1-4 prepare your input files by reformatting to pipeline compatible formats.

Run the classification pipeline:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages