ProteinGym Base

ProteinGym has become a widely used resource for comparing the ability of models to predict the effects of protein mutations. ProteinGym Base is an effort to bring more structure to the datasets so that they become easier to work with and to distribute; and to the models in the benchmark so that they become easier to install and run. Less undocumented csv's and more formal schemas. Fewer scripts with hardcoded paths and more Dockerfiles and precise requirements.

Protein Gym Dataset

Project Structure

├── docs                       <-- Folder with documentation
│   ├── decisions/             <-- Architecture decision records
│   └── *.md                   <-- Other documentation files
├── example_data/              <-- Example data folder
│   ├── NEIME_2019/            <-- NEIME 2019 dataset
│   └── neime_2019.toml        <-- NEIME 2019 manifest file
├── notebooks/                 <-- Jupyter notebooks
│   └── *.ipynb                <-- Demonstration notebooks
├── src/                       <-- Source code folder
│   └── proteingym/            <-- Namespace package
│       └── base/              <-- Package
├── tests/                     <-- Test folder
│   └── test_*.py              <-- Test files
├── .adr-dir                   <-- Architecture decision records folder
├── .gitignore                 <-- Git ignore file
├── .pre-commit-config.yaml    <-- Pre-commit configuration file
├── .python-version            <-- Python version file
├── CONTRIBUTING.md            <-- Contribution guide
├── pyproject.toml             <-- Project configuration file
├── README.md                  <-- This README file
└── uv.lock                    <-- Dependency lock file

Dataset Manifest

The dataset manifest is a configuration file that describes the dataset metadata and assets:

Assays
Structures
Sequences
MSAs (Multiple Sequence Alignments)

The full schema of the manifest is described in the schema. Below example code uses the NEIME 2019 dataset.

Installation

To install the package, you can use pip:

$ pip install git+https://github.com/ProteinGym/proteingym-base.git

Quickstart example

Below is a quickstart example of how to use this package.

Load data

You can load the data using a manifest file. In the example code below we load the NEIME 2019 dataset manifest:

>>> from proteingym.base import Dataset, Manifest
>>> manifest = Manifest.from_path("example_data/neime_2019.toml")
>>> manifest.name
'NEIME_2019'
>>> dataset = Dataset.from_manifest(manifest)
>>> len(dataset.assays) > 0 and len(dataset.structures) > 0
True

This wil gather data from the locations specified in the manifest into a single Dataset object. Go ahead with using its data for model training or prediction.

Archive data

You can persist data in a Protein Gym archive for easy sharing and reloading.

>>> archive_path = dataset.dump(path="example_data/")
>>> archive_path.is_file() and archive_path.stat().st_size > 0  # The archive contains the dataset
True

Load archived data

You can quickly load the archived data:

>>> persisted_dataset = Dataset.from_path(archive_path)
>>> persisted_dataset.name
'NEIME_2019'
>>> archive_path.unlink()  # (FOR TESTING PURPOSES ONLY: remove the archive file for cleanup)

Access proteingym data

The Dataset object provides access to the proteingym data:

Assays
Sequences
Structures
MSAs (Multiple Sequence Alignments)

Multiple Sequence Alignment (MSA)

When loading MSA data, add the following section in the toml:

[[ msas ]]
path = "example_data/v2/A0A1I9GEU1_NEIME_Kennouche_2019/msa.fasta"
format = "fasta"

To get the MSA as MultipleSeqAlignment object:

>>> from proteingym.base import Manifest, Dataset
>>> from Bio.Align import MultipleSeqAlignment
>>> mf = Manifest.from_path("example_data/neime_2019.toml")
>>> dataset = Dataset.from_manifest(mf)
>>> isinstance(dataset.msas[0].value, MultipleSeqAlignment)  # The first MSA in the dataset
True

Example Data

The NEIME Kennouche 2019 (UniProt id: A0A1I9GEU1) dataset is used as an example. This dataset is stored in example_data/NEIME_2019 and contains the following:

Note

AssayMeta and DatasetMeta are just examples of possible meta tags one might think of. Current information in there is not associated to the dataset at all and not obtained from official sources.

Note

In Assay.csv we also contain the split and engineer round column. Engineering round is randomly allocated to 1, 2 or 3 for illustrative purposes. Orginal assay belongs to a single engineering round. Split column converted the fold_random_5 from a k-split to train/val/test split with kfolds 0, 1, 2 in train, 3 in val, 4 in test.

.
├── example_data
│   └── NEIME_2019
│       ├── A0A1I9GEU1.fasta        # Parent sequence
│       ├── AssayMeta.json          # Example of possible AssayMeta
│       ├── Assays                  
│       │   └── Assay.csv           # Tabular format of assay
│       ├── DataSetMeta.json        # Example of possible DatasetMeta
│       ├── MSA
│       │   ├── msa_weights.npy     # weights file for MSA as obtained from PG1.
│       │   ├── msa.a2m             # MSA file in .a2m format
│       │   ├── msa.a3m             # MSA file in .a3m format
│       │   └── msa.psi             # MSA file in .psi format
│       └── Structures              # 5 types of example structures with different
│           │                       # file types and sources for examples:
│           ├── experimental.cif
│           ├── experimental.bcif
│           ├── experimental.pdb
│           ├── computational.cif
│           └── computational.pdb

For a full overview of available data see the following table:

	Dataset name	Link to website	Relative path to manifest
1.	NEIME2019	www.proteingym.org	example_data/neime_2019.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ProteinGym Base

Project Structure

Dataset Manifest

Installation

Quickstart example

Load data

Archive data

Load archived data

Access proteingym data

Multiple Sequence Alignment (MSA)

Example Data

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 7

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 942 Commits
.github		.github
docs		docs
example_data		example_data
src/proteingym/base		src/proteingym/base
tests		tests
.adr-dir		.adr-dir
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
codecov.yml		codecov.yml
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

License

ProteinGym/proteingym-base

Folders and files

Latest commit

History

Repository files navigation

ProteinGym Base

Project Structure

Dataset Manifest

Installation

Quickstart example

Load data

Archive data

Load archived data

Access proteingym data

Multiple Sequence Alignment (MSA)

Example Data

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 7

Uh oh!

Languages

Packages