MIRROR: Multi-Modal Pathological Self-Supervised Representation Learning via Modality Alignment and Retention

Tianyi Wang¹, Jianan Fan¹, Dingxin Zhang¹, Dongnan Liu¹, Yong Xia², Heng Huang³, and Weidong Cai¹

¹The University of Sydney ²Northwestern Polytechnical University ³University of Maryland

🎉 MIRROR has been accepted for publication in IEEE Transactions on Medical Imaging (TMI).

Please give us a 🌟 if you find our work useful!

Introduction

This is the official implementation of "MIRROR: Multi-Modal Pathological Self-Supervised Representation Learning via Modality Alignment and Retention". MIRROR is a novel multi-modal representation learning method designed to foster both modality alignment and retention. The detailed architecture is shown below.

We aspire for this repository to serve as a foundational framework for future research in computational pathology. It is fully equipped with a comprehensive suite of tools and all the necessary code for typical downstream tasks in this field.

Roadmap

Getting Started

Clone the Repository

git clone git@github.com:TianyiFranklinWang/MIRROR.git

Setup Environment

Python version: 3.10.16 (use compatible version to avoid any issues)

python -m pip install -r requirements.txt
lintrunner init

Prepare Data

Histopathology Data

Follow the following steps to prepare the histopathology data:

Download the TCGA dataset from the GDC Data Portal.
Organize the slides by cohort in the ./input/TCGA/[cohort] directory.
Use the following script to extract patches from the slides:

python ./tools/gen_patch.py --input-dir ./input/wsi/TCGA \
 --cohorts TCGA_BRCA \
 TCGA_LUAD \
 TCGA_LUSC \
 TCGA_COAD \
 TCGA_READ \
 TCGA_KIRC \
 TCGA_KIRP \
 TCGA_KICH

In this example, we specify the cohorts used in our manuscript. To adjust the cohorts, simply replace the names in the --cohorts argument with your desired cohort names.

Use ./tools/feature_generation/gen_patch_feature.py to generate features from the patches. We give resnet50 and Phikon models as backbones. Note: Configuration is currently managed through the Config class. We plan to update this to use argparser for argument parsing in future updates.

python ./tools/feature_generation/gen_patch_feature.py

Transcriptomics Data

We provide the processed transcriptomics data on Kaggle, Hugging Face and Zenodo for TCGA-BRCA, TCGA-NSCLC, TCGA-COADREAD, and TCGA-RCC.

For custom data preparation:

Download the transcriptomics data tcga_RSEM_isoform_fpkm.gz and mapping table probeMap_gencode.v23.annotation.transcript.probemap from Xena.
After unzipping the transcriptomics file you will get a tsv file tcga_RSEM_isoform_fpkm. Put the extracted file and mapping table into the ./input/raw_rna_features/ directory. Tip: I strongly recommend converting the tcga_RSEM_isoform_fpkm file from tsv to Apache Parquet using pandas and setting the first column as the index. This will speed up the processing and is compatible with our script.
Download disease related genes from COSMIC database and put it under ./input/raw_rna_feature/[cohort].
Use ./tools/distill_rna_feature.py to generate the pruned transcriptomics features.

python ./tools/distill_rna_feature.py --cohort [cohort] \
 --cosmic-genes [cosmic_file_name] \
 --wsi-feature-root ./input/wsi_feature/[backbone]/TCGA_FEATURE \
 --classes [class0_in_cohort] [class1_in_cohort] ...

Survival Analysis Data

You can collect the data from cBioPortal and place it under ./input/survival.

Pre-Training

We have two types of pre-training scripts. train_pretrain.py script is the vanilla CLIP-like training script and train_mirror.py script is for training our method. For both scripts we adopted a YAML-based configuration system, you can find configuration templates in ./configs, and you can modify them accordingly to suit your needs.

We also provide two ways to launch these scripts. One way is using bash scripts in ./scripts directory.

./scripts/run_train_pretrain.sh <nnodes> <nproc_per_node> <rdzv_backend> <rdzv_endpoint> <config_file> <fold_nb> [additional_args...]

./scripts/run_train_mirror.sh <nnodes> <nproc_per_node> <rdzv_backend> <rdzv_endpoint> <config_file> <fold_nb> [additional_args...]

nnodes, nproc_per_node, rdzv_backend and rdzv_endpoint are the parameters for distributed training, you can find out more details here. config_file is the path to the configuration file. fold_nb is the fold number for cross-validation. additional_args are additional keywords arguments in training scripts.

Another way is through a naive job manager python script we provided in ./tools/pretrain_job_launcher.py, which can automatically collect pre-training jobs and manage GPU resources.

python ./tools/pretrain_job_launcher.py --gpu-count [number_of_gpus] \
 --virtual-gpu-count [number_of_jobs_per_gpu] \
 --pretrain-launch-script [pretrain_launch_script] \
 --pretrain-config [config_file]

Downstream Tasks Evaluation

Downstream tasks evaluation scripts also adopt a YAML-based configuration system, you can find the templates in ./configs. And, we also provide two different ways to initiate. Through bash script and job launcher.

./scripts/run_train_subtyping.sh <nnodes> <nproc_per_node> <rdzv_backend> <rdzv_endpoint> <config_file> <fold_nb> [checkpoint] [additional_args...]

./scripts/run_train_survival.sh <nnodes> <nproc_per_node> <rdzv_backend> <rdzv_endpoint> <config_file> <fold_nb> [checkpoint] [additional_args...]

checkpoint is an optional positional argument to specify if it should load pre-trained weight.

The job launcher ./tools/downstream_tasks_evaluator.py can simultaneously manage both subtyping and survival analysis tasks.

python ./tools/downstream_tasks_evaluator.py --gpu-count [number_of_gpus] \
 --virtual-gpu-count [number_of_jobs_per_gpu] \
 --result-dir [pretrain_result_directory] \
 --checkpoint-file [pretrain_checkpoint_file_name] \
 --subtyping-linprob-config [subtyping_linear_probing_config_file] \
 --subtyping-10shot-config [subtyping_10shot_config_file] \
 --survival-linprob-config [survival_linear_probing_config_file] \
 --survival-10shot-config [survival_10shot_config_file]

--result-dir and --checkpoint-file are optional, if specified they will load the pre-trained weights automatically.

Miscellaneous Tools

We also provide a number of miscellaneous tools to help your workflows.

./tools/gen_splits.py is used to generate 5-fold cross-validation splits for each cohort.

python ./tools/gen_splits.py --root [path_to_wsi_features] \
 --class-name [TCGA_class_name]

./tools/gen_few_shot_files.py is used to generate 5-fold cross-validation few-shot splits.

python ./tools/gen_few_shot_files.py --class-name [TCGA_class_name] \
 --survival-wsi-feature-dir [path_to_wsi_features_with_cohort] \
 --subyping-wsi-feature-dir [path_to_wsi_features_without_cohort] \
 --subyping-classes [class0_in_cohort] [class1_in_cohort] ... \
 --rna-feature-csv [path_to_pruned_rna_features] \
 --survival-csv [path_to_survival_data] \
 --split-dir [path_to_5foldcv_splits]

./tools/split_subtypes.py is used to split the features by subtypes within a cohort.

python ./tools/split_subtypes.py --input-folder [path_to_wsi_features] \
 --oncotree-code-csv [path_to_survival_data] \
 --target-oncotree-codes [subtype1_oncotree_code] [subtype2_oncotree_code] ...

./tools/split_weights.py is used to split the pre-trained weights into histopathology and transcriptomics parts.

python ./tools/split_weights.py --result-dir [path_to_pretrain_result] \
 --weight-file [weight_file_name]

Linting Code

We use lintrunner to check our code. You can run the following command to check the code quality.

lintrunner --all-files -a

Contact

For any inquiries or if you encounter issues, please feel free to contact us or open an issue.

License

This project is released under the General Public License v3.0. Please see the LICENSE file for more information.

Citation

If you find our work useful, please cite it using the following BibTeX entry:

@article{wang2025mirror,
  author={Wang, Tianyi and Fan, Jianan and Zhang, Dingxin and Liu, Dongnan and Xia, Yong and Huang, Heng and Cai, Weidong},
  journal={IEEE Transactions on Medical Imaging}, 
  title={MIRROR: Multi-Modal Pathological Self-Supervised Representation Learning via Modality Alignment and Retention}, 
  year={2025},
  volume={},
  number={},
  pages={1-1},
  doi={10.1109/TMI.2025.3632555}}

Develop with ❤️ by Tianyi Wang @ The University of Sydney

Name		Name	Last commit message	Last commit date
Latest commit History 119 Commits
.github		.github
configs		configs
datasets		datasets
docs		docs
losses		losses
models		models
scripts		scripts
splits/5foldcv		splits/5foldcv
tools		tools
utils		utils
.gitignore		.gitignore
.lintrunner.toml		.lintrunner.toml
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements-lintrunner.txt		requirements-lintrunner.txt
requirements.txt		requirements.txt
train_mirror.py		train_mirror.py
train_pretrain.py		train_pretrain.py
train_subtyping.py		train_subtyping.py
train_survival.py		train_survival.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MIRROR: Multi-Modal Pathological Self-Supervised Representation Learning via Modality Alignment and Retention

Introduction

Roadmap

Getting Started

Clone the Repository

Setup Environment

Prepare Data

Histopathology Data

Transcriptomics Data

Survival Analysis Data

Pre-Training

Downstream Tasks Evaluation

Miscellaneous Tools

Linting Code

Contact

License

Citation

About

Uh oh!

Releases

Uh oh!

Contributors 3

Uh oh!

Languages

License

TianyiFranklinWang/MIRROR

Folders and files

Latest commit

History

Repository files navigation

MIRROR: Multi-Modal Pathological Self-Supervised Representation Learning via Modality Alignment and Retention

Introduction

Roadmap

Getting Started

Clone the Repository

Setup Environment

Prepare Data

Histopathology Data

Transcriptomics Data

Survival Analysis Data

Pre-Training

Downstream Tasks Evaluation

Miscellaneous Tools

Linting Code

Contact

License

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Uh oh!

Contributors 3

Uh oh!

Languages