eTAPE: endometrial Tissue-AdaPtive autoEncoder for accurate deconvolution and gene expression analysis
This model is able to accurately deconvolve bulk RNA-seq data into cell fractions and predict cell-type-specific gene expression at cell-type level based on scRNA-seq data, with a specific focus on endometrial tissue and predicting the cell development time per cell type.
This repository contains the code for eTAPE, a modified version of the TAPE model. Highly experimental, proceed with caution.
NB: As of 12.2025 this is not developed anymore due to insufficient predictive power of the TAPE approach for dual task learning of time-series prediction and deconvolution.
eTAPE uses PyTorch as its Deep-learning framework, so the suitable version of PyTorch will accelerate the model training process. We recommend users to install PyTorch(>=1.8.0) with the right compute platform (CUDA, CPU or ROCm) from its official website in advance.
Required Files:
- single-cell reference: txt format, indices are cell types, columns are gene names
- bulk data: tabular format, needed to specify the seperation ('\t',','or others), indices are sample names, columns are gene names
- gene length file: used to scale the expression value, columns should contain: [Gene name, Transcript start (bp), Transcript end (bp)]. This is provided in ./data/ directory.
Warning: single-cell reference and bulk samples should contain the same cell types
# basic example
from eTAPE import Deconvolution
SignatureMatrix, CellFractionPrediction = \
Deconvolution(sc_ref, bulkdata, sep='\t', scaler='mms',
datatype='counts', genelenfile='./GeneLength.txt',
mode='overall', adaptive=True, variance_threshold=0.98,
save_model_name=None,
batch_size=128, epochs=128, seed=1)parameters:
- scaler: use 'mms' or 'ss' scaler to preprocess datasets, 'mms' stands for min-max scaler, 'ss' stands for standard scaler. In the paper, all datasets were tested using 'mms'.
- datatype: use 'counts'. Users can choose different normalization method based on your single-cell seq technique, if single-cell data is from 10X Genomics, users should use 'counts' to maintain a resonable procedure. The explanation could be found from the webpage.
- mode: 'overall' or 'high-resolution'. If you need signature matrix for each sample, use 'high-resolution' mode.
- adaptive: True or False. If this is False, then it would not predict signature matrix, the return will be None
- variance_threshold: Float number from 0 to 1, it means how many genes you want to keep (in proportion) according to variance from high to low.
- batch_size: int, related to training result. 32-128 are recommended. Smaller batch_size leads to more time consumption.
- epochs: int, related to training result. Typically, 5000-10000 iterations are enough for TAPE, the relation is
$epochs=\frac{iteration \times batch_size}{sampleing_num}$ - seed: now, eTAPE supports pinning the random seed to make results being reproducible.
An example is placed in the Experiments directory. Please run the example to get familiar with eTAPE.
If you find any bugs or have problems when you are using eTAPE, feel free to raise issues.