Mapping between features for MS-proteomics support

Hi, great package!

---  
I’m a member of the `scverse` proteomics interest group (led by @mikelkou) and, together with @vbrennsteiner co-developer of [alphapepttools](https://github.com/MannLabs/alphapepttools.git) that uses anndata as central data structure to perform exploratory downstream analysis of LC/MS-proteomics data.

Within the proteomics interest group we observed that many properties of MuData would be very useful for the analysis of MS proteomics data. However, **MuData currently lacks a first-class way to represent and query explicit relationships between feature indices across modalities**.


## Why? 
In LC/MS-proteomics, we have a naturally arising hierarchical feature structure

-	At the lowest level, we detect + quantify fragments from charged peptides (**precursors**) in the mass spectrometry (MS) instruments. The precursor-level data is relatively large (`N samples x ~100 000 features`)
-	**Proteomics search engines** identify the precursor sequences and match them to their corresponding proteins. Ultimately, search engines derive **protein-specific intensities** (`N samples x 3000-10000 features`). Note that there is an N:M relationship between precursors and proteins, i.e. many precursors can map to one protein and sometimes a precursor could potentially be derived from different (homologous) proteins.

While the protein intensities are most commonly used for the biological interpretation of the data, many analyses depend on the knowledge of the precursor or peptide-level intensities. This includes, for example, 
- software packages for differential expression testing, e.g. ([alphaquant](https://github.com/MannLabs/alphaquant.git) from our lab) 
- [peptide-level assays](https://mannlabs.github.io/alphapepttools/notebooks/04_differential_expression.html)
- but also basic quality control/data inspection tasks. 

Currently, we are able to read any of the desired feature levels (precursor < peptides < proteins < genes) [into anndata objects](https://mannlabs.github.io/alphapepttools/api.html#io). And we can obviously also bind them to [mudata objects](https://alphabase.readthedocs.io/en/latest/tutorials/tutorial_scverse_compatibility.html#Bind-precursor-and-protein-data-in-a-single-container). However, at the current state, mudata does not support the efficient mapping/querying of features between these different layers of evidence to answer questions like:

- "Which protein belongs to the peptide with the sequence "PETPIDE"
- "Plot a boxplot of the intensity distributions of precursors that map to the protein MAPK"

## Suggested feature
What would simplify the development of proteomics tools in the scverse ecosystem is an explicit and queryable way to link features across modalities within a MuData object — for example, a directed acyclic graph (DAG) or similar abstraction that formalizes relationships between variable indices across containers.

## Current state
There are already ongoing efforts/suggestions into this direction 
- @mffrank implemented a proof-of-principle DAG mapping for mudata 
- @ilan-gold suggested to use [pandas extension index arrays](https://pandas.pydata.org/docs/reference/api/pandas.api.extensions.ExtensionArray.html) to achieve this behavior. 

We are opening this issue
1. Because we wanted to ask whether such functionality would be considered in scope for `mudata` itself (or better suited for an extension)
2. If so, discuss possible implementation approaches (e.g. those suggested by @mffrank or @ilan-gold)
3. Avoid redundant or fragmented efforts across projects.

While we focus here on LC/MS proteomics, similar hierarchical or many-to-many feature relationships also arise in other multi-modal settings (e.g. ATAC-RNA, Protein-RNA, ...) so we think that this would generally be a super useful feature.

We would be really happy to help in any way to enable this feature, please let us know what the best way to contribute is.

---
@mikelkou @josenimo @LucaMarconato @melonora Theodoros Visvikis @mffrank @ilan-gold @soroorh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Mapping between features for MS-proteomics support #111

Why?

Suggested feature

Current state

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Mapping between features for MS-proteomics support #111

Description

Why?

Suggested feature

Current state

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions