Skip to content

[IEEE TMI 2025] MIRROR: Multi-Modal Pathological Self-Supervised Representation Learning via Modality Alignment and Retention

License

Notifications You must be signed in to change notification settings

TianyiFranklinWang/MIRROR

Repository files navigation

MIRROR: Multi-Modal Pathological Self-Supervised Representation Learning via Modality Alignment and Retention

Tianyi Wang1, Jianan Fan1, Dingxin Zhang1, Dongnan Liu1, Yong Xia2, Heng Huang3, and Weidong Cai1

1The University of Sydney     2Northwestern Polytechnical University     3University of Maryland

IEEE TMI Early Access arXiv Code


🎉 MIRROR has been accepted for publication in IEEE Transactions on Medical Imaging (TMI).


Please give us a 🌟 if you find our work useful!

Introduction

This is the official implementation of "MIRROR: Multi-Modal Pathological Self-Supervised Representation Learning via Modality Alignment and Retention". MIRROR is a novel multi-modal representation learning method designed to foster both modality alignment and retention. The detailed architecture is shown below.

We aspire for this repository to serve as a foundational framework for future research in computational pathology. It is fully equipped with a comprehensive suite of tools and all the necessary code for typical downstream tasks in this field.

MIRROR Architecture

Roadmap

  • Release the main codebase
  • Release tools
  • Update README
  • Document code
  • Add code structure in README

Getting Started

Clone the Repository

git clone git@github.com:TianyiFranklinWang/MIRROR.git

Setup Environment

Python version: 3.10.16 (use compatible version to avoid any issues)

python -m pip install -r requirements.txt
lintrunner init

Prepare Data

Histopathology Data

Follow the following steps to prepare the histopathology data:

  • Download the TCGA dataset from the GDC Data Portal.
  • Organize the slides by cohort in the ./input/TCGA/[cohort] directory.
  • Use the following script to extract patches from the slides:
python ./tools/gen_patch.py --input-dir ./input/wsi/TCGA \
 --cohorts TCGA_BRCA \
 TCGA_LUAD \
 TCGA_LUSC \
 TCGA_COAD \
 TCGA_READ \
 TCGA_KIRC \
 TCGA_KIRP \
 TCGA_KICH

In this example, we specify the cohorts used in our manuscript. To adjust the cohorts, simply replace the names in the --cohorts argument with your desired cohort names.

  • Use ./tools/feature_generation/gen_patch_feature.py to generate features from the patches. We give resnet50 and Phikon models as backbones. Note: Configuration is currently managed through the Config class. We plan to update this to use argparser for argument parsing in future updates.
python ./tools/feature_generation/gen_patch_feature.py

Transcriptomics Data

We provide the processed transcriptomics data on Kaggle, Hugging Face and Zenodo for TCGA-BRCA, TCGA-NSCLC, TCGA-COADREAD, and TCGA-RCC.

For custom data preparation:

  • Download the transcriptomics data tcga_RSEM_isoform_fpkm.gz and mapping table probeMap_gencode.v23.annotation.transcript.probemap from Xena.
  • After unzipping the transcriptomics file you will get a tsv file tcga_RSEM_isoform_fpkm. Put the extracted file and mapping table into the ./input/raw_rna_features/ directory. Tip: I strongly recommend converting the tcga_RSEM_isoform_fpkm file from tsv to Apache Parquet using pandas and setting the first column as the index. This will speed up the processing and is compatible with our script.
  • Download disease related genes from COSMIC database and put it under ./input/raw_rna_feature/[cohort].
  • Use ./tools/distill_rna_feature.py to generate the pruned transcriptomics features.
python ./tools/distill_rna_feature.py --cohort [cohort] \
 --cosmic-genes [cosmic_file_name] \
 --wsi-feature-root ./input/wsi_feature/[backbone]/TCGA_FEATURE \
 --classes [class0_in_cohort] [class1_in_cohort] ...

Survival Analysis Data

You can collect the data from cBioPortal and place it under ./input/survival.

Pre-Training

We have two types of pre-training scripts. train_pretrain.py script is the vanilla CLIP-like training script and train_mirror.py script is for training our method. For both scripts we adopted a YAML-based configuration system, you can find configuration templates in ./configs, and you can modify them accordingly to suit your needs.

We also provide two ways to launch these scripts. One way is using bash scripts in ./scripts directory.

./scripts/run_train_pretrain.sh <nnodes> <nproc_per_node> <rdzv_backend> <rdzv_endpoint> <config_file> <fold_nb> [additional_args...]
./scripts/run_train_mirror.sh <nnodes> <nproc_per_node> <rdzv_backend> <rdzv_endpoint> <config_file> <fold_nb> [additional_args...]

nnodes, nproc_per_node, rdzv_backend and rdzv_endpoint are the parameters for distributed training, you can find out more details here. config_file is the path to the configuration file. fold_nb is the fold number for cross-validation. additional_args are additional keywords arguments in training scripts.

Another way is through a naive job manager python script we provided in ./tools/pretrain_job_launcher.py, which can automatically collect pre-training jobs and manage GPU resources.

python ./tools/pretrain_job_launcher.py --gpu-count [number_of_gpus] \
 --virtual-gpu-count [number_of_jobs_per_gpu] \
 --pretrain-launch-script [pretrain_launch_script] \
 --pretrain-config [config_file]

Downstream Tasks Evaluation

Downstream tasks evaluation scripts also adopt a YAML-based configuration system, you can find the templates in ./configs. And, we also provide two different ways to initiate. Through bash script and job launcher.

./scripts/run_train_subtyping.sh <nnodes> <nproc_per_node> <rdzv_backend> <rdzv_endpoint> <config_file> <fold_nb> [checkpoint] [additional_args...]
./scripts/run_train_survival.sh <nnodes> <nproc_per_node> <rdzv_backend> <rdzv_endpoint> <config_file> <fold_nb> [checkpoint] [additional_args...]

checkpoint is an optional positional argument to specify if it should load pre-trained weight.

The job launcher ./tools/downstream_tasks_evaluator.py can simultaneously manage both subtyping and survival analysis tasks.

python ./tools/downstream_tasks_evaluator.py --gpu-count [number_of_gpus] \
 --virtual-gpu-count [number_of_jobs_per_gpu] \
 --result-dir [pretrain_result_directory] \
 --checkpoint-file [pretrain_checkpoint_file_name] \
 --subtyping-linprob-config [subtyping_linear_probing_config_file] \
 --subtyping-10shot-config [subtyping_10shot_config_file] \
 --survival-linprob-config [survival_linear_probing_config_file] \
 --survival-10shot-config [survival_10shot_config_file]

--result-dir and --checkpoint-file are optional, if specified they will load the pre-trained weights automatically.

Miscellaneous Tools

We also provide a number of miscellaneous tools to help your workflows.

  • ./tools/gen_splits.py is used to generate 5-fold cross-validation splits for each cohort.
python ./tools/gen_splits.py --root [path_to_wsi_features] \
 --class-name [TCGA_class_name]
  • ./tools/gen_few_shot_files.py is used to generate 5-fold cross-validation few-shot splits.
python ./tools/gen_few_shot_files.py --class-name [TCGA_class_name] \
 --survival-wsi-feature-dir [path_to_wsi_features_with_cohort] \
 --subyping-wsi-feature-dir [path_to_wsi_features_without_cohort] \
 --subyping-classes [class0_in_cohort] [class1_in_cohort] ... \
 --rna-feature-csv [path_to_pruned_rna_features] \
 --survival-csv [path_to_survival_data] \
 --split-dir [path_to_5foldcv_splits]
  • ./tools/split_subtypes.py is used to split the features by subtypes within a cohort.
python ./tools/split_subtypes.py --input-folder [path_to_wsi_features] \
 --oncotree-code-csv [path_to_survival_data] \
 --target-oncotree-codes [subtype1_oncotree_code] [subtype2_oncotree_code] ...
  • ./tools/split_weights.py is used to split the pre-trained weights into histopathology and transcriptomics parts.
python ./tools/split_weights.py --result-dir [path_to_pretrain_result] \
 --weight-file [weight_file_name]

Linting Code

We use lintrunner to check our code. You can run the following command to check the code quality.

lintrunner --all-files -a

Contact

For any inquiries or if you encounter issues, please feel free to contact us or open an issue.

License

This project is released under the General Public License v3.0. Please see the LICENSE file for more information.

Citation

If you find our work useful, please cite it using the following BibTeX entry:

@article{wang2025mirror,
  author={Wang, Tianyi and Fan, Jianan and Zhang, Dingxin and Liu, Dongnan and Xia, Yong and Huang, Heng and Cai, Weidong},
  journal={IEEE Transactions on Medical Imaging}, 
  title={MIRROR: Multi-Modal Pathological Self-Supervised Representation Learning via Modality Alignment and Retention}, 
  year={2025},
  volume={},
  number={},
  pages={1-1},
  doi={10.1109/TMI.2025.3632555}}

Develop with ❤️ by Tianyi Wang @ The University of Sydney

About

[IEEE TMI 2025] MIRROR: Multi-Modal Pathological Self-Supervised Representation Learning via Modality Alignment and Retention

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Contributors 3

  •  
  •  
  •