The models are included in the models folder, where each model occupies a subfolder as its repo.
A model repo contains its README.md as a model card, which comes in two parts:
- Metadata, which is a YAML section at the top, i.e., front matter.
- Text descriptions, which is a Markdown file, including summary and descriptions of the model.
For more information, you can reference Hugging Face's model cards.
The datasets are included in the dataset folder, where each dataset is an archived file with suffix pgdata.
In order to build the archived file for each dataset, proteingym-base is used.
You can reference this guide to build the archived dataset.
The benchmark is defined in the benchmark folder, where there exist two games: supervised and zero-shot. Each game has its selected list of models and datasets defined in dvc.yaml.
The models and datasets are defined in vars at the top, and DVC translates vars into a matrix, which is namely a loop defined as the following pseudo-code:
for dataset in datasets:
for model in models:
predict()
for dataset in datasets:
for model in models:
calculate_metric()In order to benchmark a selected list of models and datasets, it depends on the following criteria:
- Generate your own
datasets.json. - Have Docker model images locally.
- Create your own
models.json.
To generate the datasets.json , you need to use the proteingym-base command:
proteingym-base list-datasets datasetswill list all datasets under the folderdatasets.jqis used to filter the datasets.
proteingym-base list-datasets datasets | jq ... > benchmark/supervised/local/datasets.jsonFor more information, you can check out CONTRIBUTING.md.
The DVC pipelines run based on the local Docker images of models. The local images can be either built from Dockerfile or pulled from a remote Docker registry.
To build an image from Dockerfile:
docker build \
-f models/pls/Dockerfile \
-t pls:latest \
models/plsTo pull an image from a remote Docker registry:
docker pull <repo>/pls:latest
docker tag <repo>/pls:latest pls:latestThe example models.json looks like below, with a list of models defined by its name and local image name:
{
"models": [
{
"name": "pls",
"image": "pls:latest"
}
]
}With datasets.json and models.json present in each game's folder: namely supervised and zero_shot, and Docker is running with the local model images, you can start to benchmark.
You can benchmark a group of supervised models:
dvc repro benchmark/supervised/dvc.yaml -sYou can benchmark a group of zero-shot models:
dvc repro benchmark/zero_shot/dvc.yaml -sNote
By default, all pipelines configured by dvc.yaml will be recursively checked when executing dvc repro. As a result, if either datasets.json or models.json are missing in any pipelines, an error will be thrown. So the command option --single-item (-s) is used to restrict what gets checked by turning off the recursive search for changed dependencies of all pipelines.
For example, if you run dvc repro ... -s in supervised folder, only datasets.json and models.json in supervised folder are checked for its dvc.yaml dependencies, excluding the zero_shot folders.
Tip
To run specific parts of the pipeline with DVC, you can run dvc repro --downstream <stage_name>. For example, dvc repro --downstream calculate_metric.
Tip
To ignore cache and run anew, you can run dvc repro --force.
Tip
By default, DVC will stop execution when any stage fails. If one dataset-model pair's metric calculation fails (e.g., due to a missing prediction file, script error, or invalid data), DVC will halt the entire pipeline run. In order to prevent this blocking behavior, you can use: dvc repro --keep-going. This flag tells DVC to continue executing other stages even if some fail.
The CML (Continuous Machine Learning) pipeline is configured in cml.yaml, which will be triggered every time there is a PR submitted.
Important
If you add a new dataset in datasets or add a new model in models, please also update the datasets.json and models.json respectively in either supervised folder or zero_shot folder.
- For datasets, keep the folder name
/home/runner/work/proteingym-benchmark/proteingym-benchmark/datasets/(as this is the path where it is located in the runner) and only change your file name. - For models, it is in the format
<model_folder_name>:latest, wheremodel_folder_nameis the root folder name of each model in models.
You can find the latest metrics result in metrics.csv as the single source of truth, as the latest CML pipeline will commit the metrics back in the main branch, once it is merged.