ProteinGym has become a widely used resource for comparing the ability of models to predict the effects of protein mutations. ProteinGym Base is an effort to bring more structure to the datasets so that they become easier to work with and to distribute; and to the models in the benchmark so that they become easier to install and run. Less undocumented csv's and more formal schemas. Fewer scripts with hardcoded paths and more Dockerfiles and precise requirements.
├── docs <-- Folder with documentation
│ ├── decisions/ <-- Architecture decision records
│ └── *.md <-- Other documentation files
├── example_data/ <-- Example data folder
│ ├── NEIME_2019/ <-- NEIME 2019 dataset
│ └── neime_2019.toml <-- NEIME 2019 manifest file
├── notebooks/ <-- Jupyter notebooks
│ └── *.ipynb <-- Demonstration notebooks
├── src/ <-- Source code folder
│ └── proteingym/ <-- Namespace package
│ └── base/ <-- Package
├── tests/ <-- Test folder
│ └── test_*.py <-- Test files
├── .adr-dir <-- Architecture decision records folder
├── .gitignore <-- Git ignore file
├── .pre-commit-config.yaml <-- Pre-commit configuration file
├── .python-version <-- Python version file
├── CONTRIBUTING.md <-- Contribution guide
├── pyproject.toml <-- Project configuration file
├── README.md <-- This README file
└── uv.lock <-- Dependency lock file
The dataset manifest is a configuration file that describes the dataset metadata and assets:
- Assays
- Structures
- Sequences
- MSAs (Multiple Sequence Alignments)
The full schema of the manifest is described in the schema. Below example code uses the NEIME 2019 dataset.
To install the package, you can use pip:
$ pip install git+https://github.com/ProteinGym/proteingym-base.gitBelow is a quickstart example of how to use this package.
You can load the data using a manifest file. In the example code below we load the NEIME 2019 dataset manifest:
>>> from proteingym.base import Dataset, Manifest
>>> manifest = Manifest.from_path("example_data/neime_2019.toml")
>>> manifest.name
'NEIME_2019'
>>> dataset = Dataset.from_manifest(manifest)
>>> len(dataset.assays) > 0 and len(dataset.structures) > 0
TrueThis wil gather data from the locations specified in the manifest into a single
Dataset object. Go ahead with using its data for model training or prediction.
You can persist data in a Protein Gym archive for easy sharing and reloading.
>>> archive_path = dataset.dump(path="example_data/")
>>> archive_path.is_file() and archive_path.stat().st_size > 0 # The archive contains the dataset
TrueYou can quickly load the archived data:
>>> persisted_dataset = Dataset.from_path(archive_path)
>>> persisted_dataset.name
'NEIME_2019'
>>> archive_path.unlink() # (FOR TESTING PURPOSES ONLY: remove the archive file for cleanup)The Dataset object provides access to the proteingym data:
- Assays
- Sequences
- Structures
- MSAs (Multiple Sequence Alignments)
When loading MSA data, add the following section in the toml:
[[ msas ]]
path = "example_data/v2/A0A1I9GEU1_NEIME_Kennouche_2019/msa.fasta"
format = "fasta"To get the MSA as MultipleSeqAlignment object:
>>> from proteingym.base import Manifest, Dataset
>>> from Bio.Align import MultipleSeqAlignment
>>> mf = Manifest.from_path("example_data/neime_2019.toml")
>>> dataset = Dataset.from_manifest(mf)
>>> isinstance(dataset.msas[0].value, MultipleSeqAlignment) # The first MSA in the dataset
TrueThe NEIME Kennouche 2019 (UniProt id: A0A1I9GEU1) dataset is used as an example.
This dataset is stored in example_data/NEIME_2019 and contains the following:
Note
AssayMeta and DatasetMeta are just examples of possible meta tags one might think of. Current information in there is not associated to the dataset at all and not obtained from official sources.
Note
In Assay.csv we also contain the split and engineer round column. Engineering round is randomly allocated to 1, 2 or 3 for illustrative purposes. Orginal assay belongs to a single engineering round. Split column converted the fold_random_5 from a k-split to train/val/test split with kfolds 0, 1, 2 in train, 3 in val, 4 in test.
.
├── example_data
│ └── NEIME_2019
│ ├── A0A1I9GEU1.fasta # Parent sequence
│ ├── AssayMeta.json # Example of possible AssayMeta
│ ├── Assays
│ │ └── Assay.csv # Tabular format of assay
│ ├── DataSetMeta.json # Example of possible DatasetMeta
│ ├── MSA
│ │ ├── msa_weights.npy # weights file for MSA as obtained from PG1.
│ │ ├── msa.a2m # MSA file in .a2m format
│ │ ├── msa.a3m # MSA file in .a3m format
│ │ └── msa.psi # MSA file in .psi format
│ └── Structures # 5 types of example structures with different
│ │ # file types and sources for examples:
│ ├── experimental.cif
│ ├── experimental.bcif
│ ├── experimental.pdb
│ ├── computational.cif
│ └── computational.pdbFor a full overview of available data see the following table:
| Dataset name | Link to website | Relative path to manifest | |
|---|---|---|---|
| 1. | NEIME2019 | www.proteingym.org | example_data/neime_2019.toml |