Skip to content

Infrastructure for facilitating variant effect prediction using machine learning. Data class to combine assay, sequence, MSA and structure data to a coherent data package. Model cards for supervised and zeroshot ML models.

License

Notifications You must be signed in to change notification settings

ProteinGym/proteingym-base

Repository files navigation

ProteinGym Base

codecov CI pipeline

ProteinGym has become a widely used resource for comparing the ability of models to predict the effects of protein mutations. ProteinGym Base is an effort to bring more structure to the datasets so that they become easier to work with and to distribute; and to the models in the benchmark so that they become easier to install and run. Less undocumented csv's and more formal schemas. Fewer scripts with hardcoded paths and more Dockerfiles and precise requirements.

Project Structure

├── docs                       <-- Folder with documentation
│   ├── decisions/             <-- Architecture decision records
│   └── *.md                   <-- Other documentation files
├── example_data/              <-- Example data folder
│   ├── NEIME_2019/            <-- NEIME 2019 dataset
│   └── neime_2019.toml        <-- NEIME 2019 manifest file
├── notebooks/                 <-- Jupyter notebooks
│   └── *.ipynb                <-- Demonstration notebooks
├── src/                       <-- Source code folder
│   └── proteingym/            <-- Namespace package
│       └── base/              <-- Package
├── tests/                     <-- Test folder
│   └── test_*.py              <-- Test files
├── .adr-dir                   <-- Architecture decision records folder
├── .gitignore                 <-- Git ignore file
├── .pre-commit-config.yaml    <-- Pre-commit configuration file
├── .python-version            <-- Python version file
├── CONTRIBUTING.md            <-- Contribution guide
├── pyproject.toml             <-- Project configuration file
├── README.md                  <-- This README file
└── uv.lock                    <-- Dependency lock file

Dataset Manifest

The dataset manifest is a configuration file that describes the dataset metadata and assets:

  • Assays
  • Structures
  • Sequences
  • MSAs (Multiple Sequence Alignments)

The full schema of the manifest is described in the schema. Below example code uses the NEIME 2019 dataset.

Installation

To install the package, you can use pip:

$ pip install git+https://github.com/ProteinGym/proteingym-base.git

Quickstart example

Below is a quickstart example of how to use this package.

Load data

You can load the data using a manifest file. In the example code below we load the NEIME 2019 dataset manifest:

>>> from proteingym.base import Dataset, Manifest
>>> manifest = Manifest.from_path("example_data/neime_2019.toml")
>>> manifest.name
'NEIME_2019'
>>> dataset = Dataset.from_manifest(manifest)
>>> len(dataset.assays) > 0 and len(dataset.structures) > 0
True

This wil gather data from the locations specified in the manifest into a single Dataset object. Go ahead with using its data for model training or prediction.

Archive data

You can persist data in a Protein Gym archive for easy sharing and reloading.

>>> archive_path = dataset.dump(path="example_data/")
>>> archive_path.is_file() and archive_path.stat().st_size > 0  # The archive contains the dataset
True

Load archived data

You can quickly load the archived data:

>>> persisted_dataset = Dataset.from_path(archive_path)
>>> persisted_dataset.name
'NEIME_2019'
>>> archive_path.unlink()  # (FOR TESTING PURPOSES ONLY: remove the archive file for cleanup)

Access proteingym data

The Dataset object provides access to the proteingym data:

  • Assays
  • Sequences
  • Structures
  • MSAs (Multiple Sequence Alignments)

Multiple Sequence Alignment (MSA)

When loading MSA data, add the following section in the toml:

[[ msas ]]
path = "example_data/v2/A0A1I9GEU1_NEIME_Kennouche_2019/msa.fasta"
format = "fasta"

To get the MSA as MultipleSeqAlignment object:

>>> from proteingym.base import Manifest, Dataset
>>> from Bio.Align import MultipleSeqAlignment
>>> mf = Manifest.from_path("example_data/neime_2019.toml")
>>> dataset = Dataset.from_manifest(mf)
>>> isinstance(dataset.msas[0].value, MultipleSeqAlignment)  # The first MSA in the dataset
True

Example Data

The NEIME Kennouche 2019 (UniProt id: A0A1I9GEU1) dataset is used as an example. This dataset is stored in example_data/NEIME_2019 and contains the following:

Note

AssayMeta and DatasetMeta are just examples of possible meta tags one might think of. Current information in there is not associated to the dataset at all and not obtained from official sources.

Note

In Assay.csv we also contain the split and engineer round column. Engineering round is randomly allocated to 1, 2 or 3 for illustrative purposes. Orginal assay belongs to a single engineering round. Split column converted the fold_random_5 from a k-split to train/val/test split with kfolds 0, 1, 2 in train, 3 in val, 4 in test.

.
├── example_data
│   └── NEIME_2019
│       ├── A0A1I9GEU1.fasta        # Parent sequence
│       ├── AssayMeta.json          # Example of possible AssayMeta
│       ├── Assays                  
│       │   └── Assay.csv           # Tabular format of assay
│       ├── DataSetMeta.json        # Example of possible DatasetMeta
│       ├── MSA
│       │   ├── msa_weights.npy     # weights file for MSA as obtained from PG1.
│       │   ├── msa.a2m             # MSA file in .a2m format
│       │   ├── msa.a3m             # MSA file in .a3m format
│       │   └── msa.psi             # MSA file in .psi format
│       └── Structures              # 5 types of example structures with different
│           │                       # file types and sources for examples:
│           ├── experimental.cif
│           ├── experimental.bcif
│           ├── experimental.pdb
│           ├── computational.cif
│           └── computational.pdb

For a full overview of available data see the following table:

Dataset name Link to website Relative path to manifest
1. NEIME2019 www.proteingym.org example_data/neime_2019.toml

About

Infrastructure for facilitating variant effect prediction using machine learning. Data class to combine assay, sequence, MSA and structure data to a coherent data package. Model cards for supervised and zeroshot ML models.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

No packages published

Contributors 7

Languages