Accurate cell type annotation is essential but difficult in single-cell RNA sequencing (scRNA-seq).
- Manual approaches are time-consuming, subjective, and inconsistent.
- Automated classifiers are faster but often fail to generalize across tissues, conditions, or closely related cell types.
- A key limitation for both is the lack of true ground truth for benchmarking.
scTypeEval provides a ground-truth-agnostic framework to systematically assess annotation quality using internal validation metrics.
- Quantifies inter-sample label consistency.
- Identifies ambiguous or misclassified populations.
- Enables reproducible benchmarking of manual annotations, automated classifiers, or clustering results.
- Ground-truth agnostic – Evaluate annotations without external references.
- Cross-dataset benchmarking – Assess consistency across samples and studies.
- Customizable – Works with Seurat, SCE, or matrices; supports custom gene lists and parameters.
- Robust – Sensitive to misclassification; reliable across batch effects, label granularity, and sample sizes.
# install.packages("remotes")
remotes::install_github("carmonalab/scTypeEval")1. Create scTypeEval object
scTypeEval objects accept either a count matrix (rows as genes and columns as cells) and its corresponding metadata, a Seurat object, or a SingleCellExperiment (SCE) object. Metadata is expected to contain annotation labels and sample identifiers.
library(scTypeEval)
# From count matrix and metadata dataframe
sceval <- create.scTypeEval(matrix = count_matrix,
metadata = metadata)
# From Seurat object
sceval <- create.scTypeEval(seurat_obj)
# From SCE object
sceval <- create.scTypeEval(sce)2. Process data
Process and normalize data stored in an scTypeEval object.
This step aggregates, filters, and normalizes the count matrix, storing results as DataAssay objects for single-cell and pseudobulk data.
# Run data processing on an scTypeEval object
sceval <- Run.ProcessingData(
scTypeEval = sceval,
ident = "celltype", # column in metadata defining identities (e.g. cell type)
sample = "patient_id", # column in metadata defining sample IDs
min.samples = 5, # minimum samples required to retain a cell type
min.cells = 10, # minimum cells required per sample-celltype
)3. Obtain relevant features
Extract relevant features such as highly variable genes (HVGs) and cell type marker genes, or add custom gene lists. Dissimilarity and subsequently consistency will be evaluated using these features.
# Identify highly variable genes (HVGs)
sceval <- Run.HVG(
scTypeEval = sceval,
var.method = "scran", # method to compute HVGs
ngenes = 2000, # number of HVGs to retain
sample = TRUE # whether to perform sample-level blocking
)
# Identify marker genes per cell type
sceval <- Run.GeneMarkers(
scTypeEval = sceval,
method = "scran.findMarkers", # supported: scran.findMarkers
ngenes.celltype = 50 # max markers per cell type
)Custom gene list may be also added using add.GeneList().
Optional: Add dimensional reduction embedding
Consistency metrics can be measured directly on relevant features selected earlier.
However, for most methods, their computation in a low-dimensional space (e.g., PCA) speeds up the process while yielding very similar results.
sceval <- Run.PCA(
scTypeEval = sceval,
ndim = 30 # number of PCs
)Alternatively, you can insert pre-computed embeddings (e.g., PCA, UMAP, t-SNE) using add.DimReduction().
4. Compute dissimilarity across cell type and samples
The function Run.Dissimilarity() computes pairwise dissimilarities between cell types across samples stored in a scTypeEval object.
You can choose among several strategies depending on whether you want to compare cell type pseudobulk profiles, cell type single-cell distributions, or classification-based matches.
Available methods include:
Pseudobulk:<distance>– computes distances between pseudobulk gene expression profiles.
Supported distances: euclidean, cosine, pearson.WasserStein– computes Wasserstein distances between distributions of cells.RecipClassif:<method>– matches cells across groups using a classifier and computes dissimilarities.
Supported methods: Match, Score.
By default, if reduction = TRUE, dissimilarity is computed on dimensional reduction embeddings (e.g. PCA).
Set reduction = FALSE to instead compute dissimilarities on processed expression data.
# Euclidean distance based on pseudobulk aggregation
sceval <- Run.Dissimilarity(sceval,
method = "Pseudobulk:Euclidean",
reduction = FALSE) # whether to compute dissimilarities in low dimensional space
# Wasserstein distance on embeddings
sceval <- Run.Dissimilarity(sceval,
method = "WasserStein",
reduction = TRUE)
# Reciprocal Classification Match using SingleR classifier
sceval <- Run.Dissimilarity(sceval,
method = "RecipClassif:Match",
ReciprocalClassifier = "SingleR")Visualize dissimilarity matrix
The function plot.Heatmap() visualizes dissimilarity matrices stored in a scTypeEval object as annotated heatmaps.
This produces a ggplot2 heatmap with cell types grouped and optionally ordered by similarity or consistency.
plot.Heatmap(sceval,
dissimilarity.slot = "RecipClassif:Match",
sort.consistency = "silhouette",
sort.similarity = "Pseudobulk:Euclidean")5. Compute consistency metrics
Evaluate inter-sample label consistency.
Consistency Metrics
scTypeEval supports a range of internal validation metrics to evaluate cell type annotation quality without external ground truth:
- silhouette – standard cohesion/separation score per cell type
- 2label.silhouette – silhouette variant comparing "own type" vs. all others
- NeighborhoodPurity – fraction of K nearest neighbors sharing the same cell type
- ward.PropMatch – proportion of a cell type in its dominant cluster (Ward-based)
- Orbital.medoid – fraction of cells closer to their medoid than any other type’s medoid
- Average.similarity – within-cell type similarity relative to other types
Higher values indicate stronger internal consistency. Metrics can be computed per dissimilarity assay for downstream comparison across cell types, metrics, and methods.
consis <- get.Consistency(scTypeEval,
dissimilarity.slot = "Pseudobulk:Euclidean", # indicate in which dissimilarity space compute metrics
Consistency.metric = "silhouette" # choose consistency metric
)Example of results table:
| celltype | measure | consistency.metric | dissimilarity_method | ident |
|---|---|---|---|---|
| CD4.Tn | -0.03104529 | silhouette | Pseudobulk:Euclidean | celltype |
| CD4.Tstr | -0.01739486 | silhouette | Pseudobulk:Euclidean | celltype |
| CD4.Tfh | 0.01741703 | silhouette | Pseudobulk:Euclidean | celltype |
| CD4.Tcm | 0.04147180 | silhouette | Pseudobulk:Euclidean | celltype |
| CD8.t.Teff | 0.04487724 | silhouette | Pseudobulk:Euclidean | celltype |
| CD4.Tctl | 0.10953912 | silhouette | Pseudobulk:Euclidean | celltype |
| CD8.Teff | 0.09775360 | silhouette | Pseudobulk:Euclidean | celltype |
| CD8.Tcm | 0.10451983 | silhouette | Pseudobulk:Euclidean | celltype |
| CD8.Tn | 0.13723062 | silhouette | Pseudobulk:Euclidean | celltype |
| CD8.Tstr | 0.15740846 | silhouette | Pseudobulk:Euclidean | celltype |
| CD4.Th17 | 0.24654894 | silhouette | Pseudobulk:Euclidean | celltype |
| CD8.Tex | 0.27483708 | silhouette | Pseudobulk:Euclidean | celltype |
| CD4.Treg | 0.26537602 | silhouette | Pseudobulk:Euclidean | celltype |
The manuscript describing these methods is currently in preparation. A citation will be provided once the paper is published.

