Sophie Watts1, Zoë Migicovsky1, Sean Myles*1
1Department of Plant, Food, and Environmental Sciences, Faculty of Agriculture, Dalhousie University
This repository contains data and scripts used to repoduce analyses in the manuscript "Genome-wide association studies in Canada's apple biodiversity collection".
Apple fruit quality traits such as fruit texture, sugar content, and firmness retention during storage are key targets for breeders. Understanding the genetic control of fruit quality traits can enable the development of genetic markers, useful for marker-assisted breeding of new apple cultivars. We genotyped over 260,000 single nucleotide polymorphisms (SNPs) across 1,054 apple accessions from Canada’s Apple Biodiversity Collection and performed genome-wide association for 21 fruit quality and phenology traits. We identified a locus on chromosome 15 associated with phenolic content and a locus on chromosome 10 that is associated with softening. We demonstrate that the top SNP on chromosome 10 is a better predictor of softening than markers commonly used for marker-assisted breeding of this trait. In addition, we identified a single locus on chromosome 3 that is associated with numerous traits including ripening time, firmness at harvest, and firmness after storage. The top SNP at the chromosome 3 locus is a nonsynonymous mutation within the NAC18.1 transcription factor. Given the association between variation at NAC18.1 and several key traits, we propose a model for the allelic effects at NAC18.1 on apple ripening and softening.
You can download a copy of all the files in this repository by cloning the git repository:
$ git clone https://github.com/MylesLab/abc-gwas.git├── source
├── data
├── outputs
├── shell scripts
├── GWAS results
└── figures
- The
datadirectory contains all the raw data that was used for this project. - The
outputsdirectory contains the files generated from the raw data through data curation. - The
sourcedirectory contains the scripts used for data curation and performing various analyses of this project. - The
shell scriptsdirectory contains the shell scripts to run code over the command line. - The
figuresdirectory contains the intermediary figures generated for this project. Final figures were assembled using Adobe Illustrator.
phenotype_curation.RmdCode for curating phenotype table and outputting a list of phenotypes.pop_analysesCode for visualizing prinicipal components analysis.gwas_script.Rmdscript to create shell scripts to filter phenotype and genotype data, create kinship matrices, run PCA.simple_m.R scriptto run Simple M to calculate the effective number of markers.mlmm_gwas_batch_script_final.Rcode to run the MLMM GWAS.correlations.Rcode to run and visualize correlations between phenotypes of interest.top_snp_genotypes.Rmdcode to extract the genotypes of the top SNP hits from the GWAS of interest.ripening_model.RmdScript to create the ripening model figure.boxplots_manhattans.RmdCode for plotting manhattan plots and boxplots.zoom_plotsScript to plot zoom ins of manhattan plots with gene annotations.snp_varianceScript to calculate the proportion of variance explained by the top GWAS snps.
pheno_meta_data.csvPhenotype data and meta data for all accessions in the ABC from Watts et al. 2021.abc_combined_maf001_sort_vineland_imputed_pheno_hetero90_maf001.nosexFile with the apple IDS that have genetic data.abc_combined_maf001_sort_vineland_imputed_pheno_hetero90_maf001.frqMinor allele frequency file for ABC snps.abc_combined_maf001_sort_vineland_imputed_pheno_hetero90_maf001_noctgs_pruned1.txtPC values from PCA, contig snps removed and LD pruned.abc_combined_maf001_sort_vineland_imputed_pheno_hetero90_maf001_noctgs_pruned2.txtEigen values from PCA, contig snps removed and LD pruned.abc_combined_maf001_sort_vineland_imputed_pheno_hetero90_maf001_het.hetPlink file with heterozygosity per individual.top_snps.txtfile with the names of the top SNP hits from the GWAS of interest.abc_combined_maf001_sort_vineland_imputed_pheno_hetero90_maf001_top_gwas_snps.rawGenotype file that has been subset to only include the top SNPs of interest from the GWAS.abc_combined_maf001_sort_vineland_imputed_pheno_hetero90_maf001_top_gwas_snps.pedPED genotype file that has been subset to only include the top SNPs of interest from the GWAS.abc_combined_maf001_sort_vineland_imputed_pheno_hetero90_maf001_top_gwas_snps.mapMAP genotype file that has been subset to only include the top SNPs of interest from the GWAS.gene_models_20170612.gff3.gzGene model annotations from the GDDH genome version 1.vineland_snps.txtThe list of SNPs that were genotyped using HRM.
geno_pheno_meta_data.csvFiltered phenotype table that includes 1054 apple IDs that have both phenotype data and genotype data.pheno_list.txtList with names of phenotypes from Watts et al. 2021.pheno_list_2017List with names of phenotypes used for GWAS.contig_snps_remove.txtList of unanchored SNPs to be removed from SNP table.pc1_vs_phenos.csvThe R-square and p-values from the correlation of phenotypes with PC1.pc2_vs_phenos.csvThe R-square and p-values from the correlation of phenotypes with PC2.phenotype_sample_sizes.txtTable with samples sizes for each phenotype.phenotypes4gwasFolder that contains two files for each phenotype: one with a list of apple IDs for that phenotype that have trait data and the second a file with the trait measurements per apple ID for that phenotype.simple_m.outOutput from Simple M package that calculates the effective number of markers.top_snp_genos.csvFile with the genotypes of the top SNP hits from the GWAS.gene_annotationsFolder containing the files with gene annotations surrounding the top GWAS hits.snp_variation.csvR-square values from LMs with top GWAS SNPs and PCs.top_snps_pheno_pcs.csvFile with genotypes of the top SNPs from GWAS, pheno data and PCs.ripening_model_summary.csvMedian values for traits measurements across the genotypic classes at NAC18.1
abc_pca.shscript to run PCA for the whole SNP set with TASSEL.genotype_filtering_plink.shscript that contains PLINK commands to filter the ABC MAP and PED to only containing apple IDs for a particular phenotype, applies a MAF filter of 0.01, and outputs a MAP and PED for each phenotype.kinship.shscript that contains the tassel commands to make a kinship matrix for each phenotype.pca.shscript commands to run PCA with tassel for each individual phenotype file.geno_raw.shscript with commands to recode PED and MAP files into .raw files for the MLMM gwas.simple_m.shscript to run simple_m.Rrun_mlmm_gwas_batch_script_final.shscript to run the MLMM GWAS code.
mlmm_pvalsFolder containing files for each phenotype with the SNP p-values from the MLMM GWAS.mlmm_qqFolder containing qq-plots from each MLMM GWAS.mlmm_manhattansFolder containing manhattan plots from each MLMM GWAS.rssFolder containing files for each phenotype with the variance explained by the co-factor SNPs at each step of the MLMM GWAS.standard_pvalsFolder containing files for each phenotype with the SNP p-values from the standard (MLM) GWAS.standard_qqFolder containing qq-plots from each standard (MLM) GWAS.standard_manhattansFolder containing manhattan plots from each standard (MLM) GWAS.sbatch_command_mlmm.txtcommands to excuterun_mlmm_gwas_batch_script_final.sh.