Skip to content

gjbaker/vae-paper

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

50 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DOI Latest release License: MIT

Computational Notebooks for "Morphology-Aware Profiling of Highly Multiplexed Tissue Images using Variational Autoencoders"

Gregory J. Baker1,2,3,&,*,#, Edward Novikov1,4,*, Shannon Coy1,2,5, Yu-An Chen1,2, Clemens B. Hug1, Zergham Ahmed1,4, Sebastián A. Cajas Ordóñez4, Siyu Huang4,%, Clarence Yapp1, Gaurav N. Joshi6, Fumiki Yanagawa6, Artem Sokolov1, Hanspeter Pfister4, Peter K. Sorger1,2,3,#

1Laboratory of Systems Pharmacology, Harvard Medical School, Boston, MA 2Ludwig Center for Cancer Research at Harvard, Harvard Medical School, Boston, MA 3Department of Systems Biology, Harvard Medical School, Boston, MA 4Harvard John A. Paulson School of Engineering and Applied Sciences, Harvard University, Cambridge, MA 5Department of Pathology, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA 6Nikon Instruments, Lexington, MA

& Current affiliation: Division of Oncological Sciences, Knight Cancer Institute, Oregon Health & Science University, Portland, OR
% Current affiliation: Visual Computing Division, School of Computing, Clemson University, Clemson, SC

*Co-first Authors: G.J.B., E.N.
#Corresponding Authors: bakergr@ohsu.edu (G.J.B.), peter_sorger@hms.harvard.edu (P.K.S.)

Abstract

Spatial proteomics (highly multiplexed tissue imaging) provides unprecedented insight into the types, states, and spatial organization of cells within preserved tissue environments. To enable single-cell analysis, high-plex images are typically segmented using algorithms that assign marker signals to individual cells. However, conventional segmentation is often imprecise and susceptible to signal spillover between adjacent cells, interfering with accurate cell type identification. Segmentation-based methods also fail to capture the morphological detail that histopathologists rely on for disease diagnosis and staging. Here, we present a method that combines unsupervised, pixel-level machine learning using autoencoders with traditional segmentation to generate single-cell data that captures information on protein abundance, morphology, and local neighborhood in a manner analogous to human experts while overcoming signal spillover. We demonstrate the generality of this technique by applying it to CyCIF, Lunaphore COMET, and Akoya PhenoCycler data, and show that it can learn histological features across multiple spatial scales.

Running the computational notebooks

Python code in this GitHub repository is organized as Jupyter notebooks for generating the figures shown in the paper. To view the notebooks, first clone the repository onto your computer by opening a terminal window and entering the following command below. If git is not already installed, you can download it by following the instructions provided here.

git clone https://github.com/labsyspharm/vae-paper.git

Next, change directories into the top-level directory of the cloned repository and create and activate a dedicated Conda environment containing the necessary Python libraries for running the code. If conda is not already installed, it can be downloaded by following the instructions provided here.

cd <path/to/cloned/repo>

# macOS
conda env create -f environment_macOS.yml
conda activate morphaeus

# PC
conda env create -f environment_PC.yml
conda activate morphaeus
pip install git+https://github.com/labsyspharm/vae.git@v0.0.7

To browse the notebooks, change directories to the src folder and activate Jupyter Lab:

jupyter lab

Notebooks are pre-populated with output cells for ease of review. To re-run notebooks or explore multiplex images displayed in the Napari image viewer by some notebooks the input data must first be downloaded from our public Amazon S3 bucket (instructions are provided in the section below).


Downloading input data files

To re-run the Jupyter notebooks, input data must first be downloaded from our public Amazon S3 bucket into the the top-level directory of the cloned repository by running the download.py script located in the src folder from the top-level of the repository. In addition to the required data, this script will also download a folder containing precomputed output files for at-a-glance ease of reference (output_reference):

# from the top-level directory of the cloned vae-paper GitHub repository
python src/download.py

Note: ~335GB of storage space is required to download the complete file set.

To re-run any of the Jupyter notebooks, double click on a notebook filename at the left of the screen to open the corresponding notebook at the right. Next click the double-arrow button at the top of the notebook interface to restart the kernel and run all of the code cells. Notebook output is saved to a folder called output in the top-level directory of the repository.


MORPHӔUS source code and demo

MORPHÆUS source code is freely available for academic re-use under the MIT license on GitHub.

To run a demonstration of the MORPHÆUS pipeline, be sure that input data files have first been downloaded as described above, then change directories to the demo directory in the cloned repository and run the following command:

vae config.yml

This will execute the pipeline on 13x13um image patches from the CyCIF-1A image presented in the paper, demonstrating all major modules ranging from single-cell sampling and image patch cropping, to VAE model training, plot visualization, and concept saliency analysis. Depending on the size of images, the cutting and storage of image patches generated in the RUN_CELLCUTTER module can be memory limiting; a minimum of 32GB RAM is required to run this demo without having to alter the cache_size_cellcutter and cells_per_chunk parameters in the MORPHÆUS configuration file (config.yml). Output is saved to demo/VAE13/

For convenience, lightly pre-trained encoder and decoder networks are provided such that the pipeline skips the VAE training module. For those interested in training a model from scratch, simply add a # to the beginning of the encoder.hdf5 and decoder.hdf5 filenames in demo/VAE13/6_train_vae/ before running the pipeline; do the same for the TRAIN_VAE.txt checkpoint file in demo/VAE13/checkpoints/. When training on CPUs using relatively modern machines, epochs are estimated to complete in about 5 minutes each; training may be accelerated greatly using GPU resources.


Zenodo archive

This GitHub repository will be archived on Zenodo following publication of the manuscript.


Funding

This work was supported by NCI grant U01-CA284207, the Harvard Ludwig Center (P.K.S., S.S.), an ASPIRE Award from The Mark Foundation for Cancer Research, and the David Liposarcoma Research Initiative, and was initiated as part of the computational toolbox for the Human Tissue Atlas Network (HTAN).


References

Baker GJ., Novikov E. et al. Morphology-Aware Profiling of Highly Multiplexed Tissue Images using Variational Autoencoders. bioRxiv (2025) https://doi.org/10.1101/2025.06.23.661064

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 99.9%
  • Other 0.1%