Skip to content

allumik/eTAPE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

eTAPE: endometrial Tissue-AdaPtive autoEncoder for accurate deconvolution and gene expression analysis

This model is able to accurately deconvolve bulk RNA-seq data into cell fractions and predict cell-type-specific gene expression at cell-type level based on scRNA-seq data, with a specific focus on endometrial tissue and predicting the cell development time per cell type.

This repository contains the code for eTAPE, a modified version of the TAPE model. Highly experimental, proceed with caution.

NB: As of 12.2025 this is not developed anymore due to insufficient predictive power of the TAPE approach for dual task learning of time-series prediction and deconvolution.

Setup

eTAPE uses PyTorch as its Deep-learning framework, so the suitable version of PyTorch will accelerate the model training process. We recommend users to install PyTorch(>=1.8.0) with the right compute platform (CUDA, CPU or ROCm) from its official website in advance.

Usage

Required Files:

  1. single-cell reference: txt format, indices are cell types, columns are gene names
  2. bulk data: tabular format, needed to specify the seperation ('\t',','or others), indices are sample names, columns are gene names
  3. gene length file: used to scale the expression value, columns should contain: [Gene name, Transcript start (bp), Transcript end (bp)]. This is provided in ./data/ directory.

Warning: single-cell reference and bulk samples should contain the same cell types

# basic example
from eTAPE import Deconvolution
SignatureMatrix, CellFractionPrediction = \
    Deconvolution(sc_ref, bulkdata, sep='\t', scaler='mms',
                  datatype='counts', genelenfile='./GeneLength.txt',
                  mode='overall', adaptive=True, variance_threshold=0.98,
                  save_model_name=None,
                  batch_size=128, epochs=128, seed=1)

parameters:

  1. scaler: use 'mms' or 'ss' scaler to preprocess datasets, 'mms' stands for min-max scaler, 'ss' stands for standard scaler. In the paper, all datasets were tested using 'mms'.
  2. datatype: use 'counts'. Users can choose different normalization method based on your single-cell seq technique, if single-cell data is from 10X Genomics, users should use 'counts' to maintain a resonable procedure. The explanation could be found from the webpage.
  3. mode: 'overall' or 'high-resolution'. If you need signature matrix for each sample, use 'high-resolution' mode.
  4. adaptive: True or False. If this is False, then it would not predict signature matrix, the return will be None
  5. variance_threshold: Float number from 0 to 1, it means how many genes you want to keep (in proportion) according to variance from high to low.
  6. batch_size: int, related to training result. 32-128 are recommended. Smaller batch_size leads to more time consumption.
  7. epochs: int, related to training result. Typically, 5000-10000 iterations are enough for TAPE, the relation is $epochs=\frac{iteration \times batch_size}{sampleing_num}$
  8. seed: now, eTAPE supports pinning the random seed to make results being reproducible.

Example

An example is placed in the Experiments directory. Please run the example to get familiar with eTAPE.

Issues

If you find any bugs or have problems when you are using eTAPE, feel free to raise issues.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages