Skip to content

Mapping between features for MS-proteomics support #111

@lucas-diedrich

Description

@lucas-diedrich

Hi, great package!


I’m a member of the scverse proteomics interest group (led by @mikelkou) and, together with @vbrennsteiner co-developer of alphapepttools that uses anndata as central data structure to perform exploratory downstream analysis of LC/MS-proteomics data.

Within the proteomics interest group we observed that many properties of MuData would be very useful for the analysis of MS proteomics data. However, MuData currently lacks a first-class way to represent and query explicit relationships between feature indices across modalities.

Why?

In LC/MS-proteomics, we have a naturally arising hierarchical feature structure

  • At the lowest level, we detect + quantify fragments from charged peptides (precursors) in the mass spectrometry (MS) instruments. The precursor-level data is relatively large (N samples x ~100 000 features)
  • Proteomics search engines identify the precursor sequences and match them to their corresponding proteins. Ultimately, search engines derive protein-specific intensities (N samples x 3000-10000 features). Note that there is an N:M relationship between precursors and proteins, i.e. many precursors can map to one protein and sometimes a precursor could potentially be derived from different (homologous) proteins.

While the protein intensities are most commonly used for the biological interpretation of the data, many analyses depend on the knowledge of the precursor or peptide-level intensities. This includes, for example,

  • software packages for differential expression testing, e.g. (alphaquant from our lab)
  • peptide-level assays
  • but also basic quality control/data inspection tasks.

Currently, we are able to read any of the desired feature levels (precursor < peptides < proteins < genes) into anndata objects. And we can obviously also bind them to mudata objects. However, at the current state, mudata does not support the efficient mapping/querying of features between these different layers of evidence to answer questions like:

  • "Which protein belongs to the peptide with the sequence "PETPIDE"
  • "Plot a boxplot of the intensity distributions of precursors that map to the protein MAPK"

Suggested feature

What would simplify the development of proteomics tools in the scverse ecosystem is an explicit and queryable way to link features across modalities within a MuData object — for example, a directed acyclic graph (DAG) or similar abstraction that formalizes relationships between variable indices across containers.

Current state

There are already ongoing efforts/suggestions into this direction

We are opening this issue

  1. Because we wanted to ask whether such functionality would be considered in scope for mudata itself (or better suited for an extension)
  2. If so, discuss possible implementation approaches (e.g. those suggested by @mffrank or @ilan-gold)
  3. Avoid redundant or fragmented efforts across projects.

While we focus here on LC/MS proteomics, similar hierarchical or many-to-many feature relationships also arise in other multi-modal settings (e.g. ATAC-RNA, Protein-RNA, ...) so we think that this would generally be a super useful feature.

We would be really happy to help in any way to enable this feature, please let us know what the best way to contribute is.


@mikelkou @josenimo @LucaMarconato @melonora Theodoros Visvikis @mffrank @ilan-gold @soroorh

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions