This is our analysis that applies the linguistic technique n-gram analysis with network theory to protein domain architectures, to represent the proteome as an abstracts the functional connections between proteins to describe either proteome-wide (base DANSy) or phenotype-specific changes from differential expression results (deDANSy).
How to cite: Please cite our bioRxiv paper, which contains further details on the methods and their applications. Here, we provide the code that produces the results and figures from the manuscript.
For this work, we recommend creating a local copy of this repository by inputting the following into your terminal:
git clone https://github.com/NaegleLab/DANSy_Applications
Create a virtual environment containing all the dependencies for the analysis using the following code in a terminal.
conda create env -f dansy_apps.yml
Activate the environment using conda activate dansy_apps for specific scripts or select the dansy_apps kernel for the Jupyter notebooks.
Several of the analysis provided here rely on reference files generated by CoDIAC. We have provided the reference files for the analysis conducted in our manuscript in the data folder. The copy of the reference file used for the analysis was generated on January 7th, 2025, and will be the default file used for analysis.
If you wish to generate the most up to date reference file to use for analysis, you will to take the following steps. First download the SwissProt ID list from Gencode and place in the main directory of your local copy of this repo. Then, go to the whole_proteome_reference.py file and change the reference file suffix variable to the current date. Finally, run the following code in a terminal to establish the environment that includes CoDIAC, which will query UniProt and InterPro for the domain architectures. (Note: This can take up to 2 hours after a fresh install as it will also establish a biomart sqlite database for use in other notebooks.)
conda create env -f dansy_codiac.yml
conda activate dansy_codiac
python scripts/whole_proteome_reference.py
conda deactivate
If you are going to use the new build, you will then have to tell each of the notebooks to use the current build of the reference file by changing the generateCompleteProteome.import_reference_file() command to generateCompleteProteome.import_reference_file(new_reference_file_suffix).
For more specific analysis, the following datasets are recommended to be downloaded from their source.
| Analysis | Dataset Source/Code |
|---|---|
| Fusion Gene Analysis | ChimerSeq Excel File from ChimerDB |
| TCGA Clinical Data | cBioPortal |
| PTM Systems | Provided in the Multispecies Reference Files and see the example CoDIAC script |
| Differential Gene Expression Results | Provided in the DEG_data folder and see the R Scripts README file |
We have provided several Jupyter notebooks that serve as examples of either DANSy or deDANSy analysis. Below are short summaries of their applications and the results, which are discussed in our manuscript.
For custom uses of DANSy or deDANSy, please visit our DANSy repository, which provides more general use cases and information on how to get started with the DANSy package for new datasets.
Analyzing the domain architectures across the human proteome and broadly characterizing the resulting network and information encoded by different versions related to n-gram length. Here, we find limiting n-grams to 10-grams in length will recapitulate network characteristics of the complete proteome and that specific domains such as the protein kinase, zinc finger C2H2, and EGF-like domains are n-grams that frequently lie along the shortest path and are connected to the most other n-gram nodes in the network.
Associated notebooks:
Focused on reversible post-translational modification systems (e.g. phosphorylation, methylation, acetylation) that operate under a reader-writer-eraser paradigm. Characterizing broad properties of how individual components combine in domain architectures and identifying general grammatical rules where eraser domains do not require additional reader domains to modify their activity. Meanwhile, reader domains will frequently associate as "adverbs" to modify the activity of writer domains. Further, the catalytic domains that act as "verbs" must be on separate proteins.
Associated notebooks:
Additional analysis on the phosphorylation systems that compares the network characteristics of domains associated with phosphotyrosine (pTyr) and phosphoserine/threonine (pSer/Thr) systems during the evolutionary period from yeast to humans. The pTyr system is evolutionarily younger and rapidly expanded during the transition to metazoans. Our analysis suggests that during this transition, species were sampling several configurations of the network before converging to a similar set of grammatical rules observed in other PTM systems.
Associated notebooks:
Here, we analyze how fusion genes may provide an avenue to explore new grammatical structures of domain architectures. Using network topology, we studied fusion genes reported in TCGA and find that the domain architectures of gene fusions still adhere to the same grammatical rules that were established for domain architectures within the complete proteome. We further explore kinase fusions and find that the diversity of domain combinations it is involved in both within the natural proteome and fusions reflect its involvement across biological functions by connecting distinct n-gram families.
Associated notebooks:
In these notebooks, we provide an example case of how differential expression DANSy (deDANSy) describes the coordinated behavior of gene expression/protein interaction networks. We focus on the gene expression changes induced by SOX10-deficiency through either a CRISPR-mediated or a therapeutic-induced mechanism. We demonstrate these two perturbations, despite showing similar phenotypes, achieve these through different domain language subnetworks. The RNA-sequencing results come from this study, which showed SOX10-deficient melanoma cells exhibit a dormant and invasive phenotype and sequencing results were retrieved from the NCBI Gene Omnibus repository with accession number GSE180568.
Additionally, we provide other example deDANSy results from other publicly available datasets. The differential gene expression results (csv files) are provided in the DEG_data folder, and the R scripts that use DESeq2 to obtain those results are provided in the R_Scripts folder. In the Jupyter notebooks, we collect and plot the data for each dataset. We also provide additional scripts that describe considerations taken during the development of deDANSy these are in deDANSy_development_considerations.
Through this work, we show the network separation and syntax enrichment analysis provided by deDANSy characterizes the molecular basis of cell phenotypes, and identifies novel information about the effectors which is distinct from current gene set enrichment analysis (GSEA) approaches.
Associated notebooks:

