Skip to content

Collection of variant effect prediction models, webapp to list them as well as DVC pipeline to do benchmarking experiments against proteingym-base datasets.

License

Notifications You must be signed in to change notification settings

ProteinGym/proteingym-benchmark

Repository files navigation

ProteinGym2 Benchmark

Models

The models are included in the models folder, where each model occupies a subfolder as its repo.

A model repo contains its README.md as a model card, which comes in two parts:

  • Metadata, which is a YAML section at the top, i.e., front matter.
  • Text descriptions, which is a Markdown file, including summary and descriptions of the model.

For more information, you can reference Hugging Face's model cards.

Datasets

The datasets are included in the dataset folder, where each dataset is an archived file with suffix pgdata.

In order to build the archived file for each dataset, proteingym-base is used.

You can reference this guide to build the archived dataset.

Benchmark

The benchmark is defined in the benchmark folder, where there exist two games: supervised and zero-shot. Each game has its selected list of models and datasets defined in dvc.yaml.

  • Supervised game is defined in this dvc.yaml.
  • Zero-shot game is defined in this dvc.yaml.

The models and datasets are defined in vars at the top, and DVC translates vars into a matrix, which is namely a loop defined as the following pseudo-code:

for dataset in datasets:
    for model in models:
        predict()

for dataset in datasets:
    for model in models:
        calculate_metric()

Prerequisites

In order to benchmark a selected list of models and datasets, it depends on the following criteria:

  1. Generate your own datasets.json.
  2. Have Docker model images locally.
  3. Create your own models.json.

Step 1: Generate datasets.json

To generate the datasets.json , you need to use the proteingym-base command:

  • proteingym-base list-datasets datasets will list all datasets under the folder datasets.
  • jq is used to filter the datasets.
proteingym-base list-datasets datasets | jq ... > benchmark/supervised/local/datasets.json

For more information, you can check out CONTRIBUTING.md.

Step 2: Build a Docker model image

The DVC pipelines run based on the local Docker images of models. The local images can be either built from Dockerfile or pulled from a remote Docker registry.

To build an image from Dockerfile:

docker build \
  -f models/pls/Dockerfile \
  -t pls:latest \
  models/pls

To pull an image from a remote Docker registry:

docker pull <repo>/pls:latest
docker tag <repo>/pls:latest pls:latest

Step 3: Create models.json

The example models.json looks like below, with a list of models defined by its name and local image name:

{
  "models": [
    {
      "name": "pls",
      "image": "pls:latest"
    }
  ]
}

Getting started

With datasets.json and models.json present in each game's folder: namely supervised and zero_shot, and Docker is running with the local model images, you can start to benchmark.

Supervised

You can benchmark a group of supervised models:

dvc repro benchmark/supervised/dvc.yaml -s

Zero-shot

You can benchmark a group of zero-shot models:

dvc repro benchmark/zero_shot/dvc.yaml -s

Note

By default, all pipelines configured by dvc.yaml will be recursively checked when executing dvc repro. As a result, if either datasets.json or models.json are missing in any pipelines, an error will be thrown. So the command option --single-item (-s) is used to restrict what gets checked by turning off the recursive search for changed dependencies of all pipelines.

For example, if you run dvc repro ... -s in supervised folder, only datasets.json and models.json in supervised folder are checked for its dvc.yaml dependencies, excluding the zero_shot folders.

Tip

To run specific parts of the pipeline with DVC, you can run dvc repro --downstream <stage_name>. For example, dvc repro --downstream calculate_metric.

Tip

To ignore cache and run anew, you can run dvc repro --force.

Tip

By default, DVC will stop execution when any stage fails. If one dataset-model pair's metric calculation fails (e.g., due to a missing prediction file, script error, or invalid data), DVC will halt the entire pipeline run. In order to prevent this blocking behavior, you can use: dvc repro --keep-going. This flag tells DVC to continue executing other stages even if some fail.

CML pipeline

The CML (Continuous Machine Learning) pipeline is configured in cml.yaml, which will be triggered every time there is a PR submitted.

Important

If you add a new dataset in datasets or add a new model in models, please also update the datasets.json and models.json respectively in either supervised folder or zero_shot folder.

  • For datasets, keep the folder name /home/runner/work/proteingym-benchmark/proteingym-benchmark/datasets/ (as this is the path where it is located in the runner) and only change your file name.
  • For models, it is in the format <model_folder_name>:latest, where model_folder_name is the root folder name of each model in models.

You can find the latest metrics result in metrics.csv as the single source of truth, as the latest CML pipeline will commit the metrics back in the main branch, once it is merged.

About

Collection of variant effect prediction models, webapp to list them as well as DVC pipeline to do benchmarking experiments against proteingym-base datasets.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 7