MIRROR: Multi-Modal Pathological Self-Supervised Representation Learning via Modality Alignment and Retention
Tianyi Wang1, Jianan Fan1, Dingxin Zhang1, Dongnan Liu1, Yong Xia2, Heng Huang3, and Weidong Cai1
1The University of Sydney 2Northwestern Polytechnical University 3University of Maryland
🎉 MIRROR has been accepted for publication in IEEE Transactions on Medical Imaging (TMI).
Please give us a 🌟 if you find our work useful!
This is the official implementation of "MIRROR: Multi-Modal Pathological Self-Supervised Representation Learning via Modality Alignment and Retention". MIRROR is a novel multi-modal representation learning method designed to foster both modality alignment and retention. The detailed architecture is shown below.
We aspire for this repository to serve as a foundational framework for future research in computational pathology. It is fully equipped with a comprehensive suite of tools and all the necessary code for typical downstream tasks in this field.
- Release the main codebase
- Release tools
- Update README
- Document code
- Add code structure in README
git clone git@github.com:TianyiFranklinWang/MIRROR.gitPython version: 3.10.16 (use compatible version to avoid any issues)
python -m pip install -r requirements.txt
lintrunner initFollow the following steps to prepare the histopathology data:
- Download the TCGA dataset from the GDC Data Portal.
- Organize the slides by cohort in the
./input/TCGA/[cohort]directory. - Use the following script to extract patches from the slides:
python ./tools/gen_patch.py --input-dir ./input/wsi/TCGA \
--cohorts TCGA_BRCA \
TCGA_LUAD \
TCGA_LUSC \
TCGA_COAD \
TCGA_READ \
TCGA_KIRC \
TCGA_KIRP \
TCGA_KICH
In this example, we specify the cohorts used in our manuscript. To adjust the cohorts, simply replace the names in the --cohorts argument with your desired cohort names.
- Use
./tools/feature_generation/gen_patch_feature.pyto generate features from the patches. We give resnet50 and Phikon models as backbones. Note: Configuration is currently managed through theConfigclass. We plan to update this to useargparserfor argument parsing in future updates.
python ./tools/feature_generation/gen_patch_feature.pyWe provide the processed transcriptomics data on Kaggle, Hugging Face and Zenodo for TCGA-BRCA, TCGA-NSCLC, TCGA-COADREAD, and TCGA-RCC.
For custom data preparation:
- Download the transcriptomics data
tcga_RSEM_isoform_fpkm.gzand mapping tableprobeMap_gencode.v23.annotation.transcript.probemapfrom Xena. - After unzipping the transcriptomics file you will get a tsv file
tcga_RSEM_isoform_fpkm. Put the extracted file and mapping table into the./input/raw_rna_features/directory. Tip: I strongly recommend converting thetcga_RSEM_isoform_fpkmfile from tsv to Apache Parquet using pandas and setting the first column as the index. This will speed up the processing and is compatible with our script. - Download disease related genes from COSMIC database and put it under
./input/raw_rna_feature/[cohort]. - Use
./tools/distill_rna_feature.pyto generate the pruned transcriptomics features.
python ./tools/distill_rna_feature.py --cohort [cohort] \
--cosmic-genes [cosmic_file_name] \
--wsi-feature-root ./input/wsi_feature/[backbone]/TCGA_FEATURE \
--classes [class0_in_cohort] [class1_in_cohort] ...You can collect the data from cBioPortal and place it under ./input/survival.
We have two types of pre-training scripts. train_pretrain.py script is the vanilla CLIP-like training script and train_mirror.py script is for training our method.
For both scripts we adopted a YAML-based configuration system, you can find configuration templates in ./configs, and you can modify them accordingly to suit your needs.
We also provide two ways to launch these scripts. One way is using bash scripts in ./scripts directory.
./scripts/run_train_pretrain.sh <nnodes> <nproc_per_node> <rdzv_backend> <rdzv_endpoint> <config_file> <fold_nb> [additional_args...]./scripts/run_train_mirror.sh <nnodes> <nproc_per_node> <rdzv_backend> <rdzv_endpoint> <config_file> <fold_nb> [additional_args...]nnodes, nproc_per_node, rdzv_backend and rdzv_endpoint are the parameters for distributed training, you can find out more details here. config_file is the path to the configuration file. fold_nb is the fold number for cross-validation. additional_args are additional keywords arguments in training scripts.
Another way is through a naive job manager python script we provided in ./tools/pretrain_job_launcher.py, which can automatically collect pre-training jobs and manage GPU resources.
python ./tools/pretrain_job_launcher.py --gpu-count [number_of_gpus] \
--virtual-gpu-count [number_of_jobs_per_gpu] \
--pretrain-launch-script [pretrain_launch_script] \
--pretrain-config [config_file]Downstream tasks evaluation scripts also adopt a YAML-based configuration system, you can find the templates in ./configs.
And, we also provide two different ways to initiate. Through bash script and job launcher.
./scripts/run_train_subtyping.sh <nnodes> <nproc_per_node> <rdzv_backend> <rdzv_endpoint> <config_file> <fold_nb> [checkpoint] [additional_args...]./scripts/run_train_survival.sh <nnodes> <nproc_per_node> <rdzv_backend> <rdzv_endpoint> <config_file> <fold_nb> [checkpoint] [additional_args...]checkpoint is an optional positional argument to specify if it should load pre-trained weight.
The job launcher ./tools/downstream_tasks_evaluator.py can simultaneously manage both subtyping and survival analysis tasks.
python ./tools/downstream_tasks_evaluator.py --gpu-count [number_of_gpus] \
--virtual-gpu-count [number_of_jobs_per_gpu] \
--result-dir [pretrain_result_directory] \
--checkpoint-file [pretrain_checkpoint_file_name] \
--subtyping-linprob-config [subtyping_linear_probing_config_file] \
--subtyping-10shot-config [subtyping_10shot_config_file] \
--survival-linprob-config [survival_linear_probing_config_file] \
--survival-10shot-config [survival_10shot_config_file]--result-dir and --checkpoint-file are optional, if specified they will load the pre-trained weights automatically.
We also provide a number of miscellaneous tools to help your workflows.
./tools/gen_splits.pyis used to generate 5-fold cross-validation splits for each cohort.
python ./tools/gen_splits.py --root [path_to_wsi_features] \
--class-name [TCGA_class_name]./tools/gen_few_shot_files.pyis used to generate 5-fold cross-validation few-shot splits.
python ./tools/gen_few_shot_files.py --class-name [TCGA_class_name] \
--survival-wsi-feature-dir [path_to_wsi_features_with_cohort] \
--subyping-wsi-feature-dir [path_to_wsi_features_without_cohort] \
--subyping-classes [class0_in_cohort] [class1_in_cohort] ... \
--rna-feature-csv [path_to_pruned_rna_features] \
--survival-csv [path_to_survival_data] \
--split-dir [path_to_5foldcv_splits]./tools/split_subtypes.pyis used to split the features by subtypes within a cohort.
python ./tools/split_subtypes.py --input-folder [path_to_wsi_features] \
--oncotree-code-csv [path_to_survival_data] \
--target-oncotree-codes [subtype1_oncotree_code] [subtype2_oncotree_code] ..../tools/split_weights.pyis used to split the pre-trained weights into histopathology and transcriptomics parts.
python ./tools/split_weights.py --result-dir [path_to_pretrain_result] \
--weight-file [weight_file_name]We use lintrunner to check our code. You can run the following command to check the code quality.
lintrunner --all-files -aFor any inquiries or if you encounter issues, please feel free to contact us or open an issue.
This project is released under the General Public License v3.0. Please see the LICENSE file for more information.
If you find our work useful, please cite it using the following BibTeX entry:
@article{wang2025mirror,
author={Wang, Tianyi and Fan, Jianan and Zhang, Dingxin and Liu, Dongnan and Xia, Yong and Huang, Heng and Cai, Weidong},
journal={IEEE Transactions on Medical Imaging},
title={MIRROR: Multi-Modal Pathological Self-Supervised Representation Learning via Modality Alignment and Retention},
year={2025},
volume={},
number={},
pages={1-1},
doi={10.1109/TMI.2025.3632555}}Develop with ❤️ by Tianyi Wang @ The University of Sydney
