Skip to content

Commit 6ffa085

Browse files
committed
update doc
1 parent 11c0080 commit 6ffa085

File tree

3 files changed

+21
-4
lines changed

3 files changed

+21
-4
lines changed

ADVANCED_USAGE.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -56,6 +56,8 @@ Below are all the arguments for `bigcodebench.evaluate` for the remote evaluatio
5656
- `--resume`: Whether to resume the evaluation, default to `True`, set to `False` to re-run the evaluation
5757
- `--id_range`: The range of the tasks to evaluate, default to `None`, e.g. `--id_range 10-20` will evaluate the tasks from 10 to 20
5858
- `--backend`: The backend to use, default to `vllm`
59+
- `--execution`: The execution backend to use, default to `gradio`. You can choose from `e2b`, `gradio`, `local`.
60+
- `--reasoning_effort`: The reasoning effort to use, default to `medium`. You can choose from `easy`, `medium`, `hard` for `o1`, `o3` and `deepseek-reasoner`(soon) models.
5961
- `--base_url`: The base URL of the backend for OpenAI-compatible APIs, default to `None`
6062
- `--instruction_prefix`: The instruction prefix for the Anthropic backend, default to `None`
6163
- `--response_prefix`: The response prefix for the Anthropic backend, default to `None`
@@ -67,7 +69,7 @@ Below are all the arguments for `bigcodebench.evaluate` for the remote evaluatio
6769
- `--samples`: The path to the generated samples file, default to `None`
6870
- `--no_execute`: Whether to not execute the samples, default to `False`
6971
- `--local_execute`: Whether to execute the samples locally, default to `False`
70-
- `--remote_execute_api`: The API endpoint for remote execution, default to `https://bigcode-bigcodebench-evaluator.hf.space/`, you can also use your own Gradio API endpoint by cloning the [bigcodebench-evaluator](https://huggingface.co/spaces/bigcode/bigcodebench-evaluator) repo and check `Use via API` at the bottom of the HF space page.
72+
- `--remote_execute_api`: The API endpoint for remote execution, default to `https://bigcode-bigcodebench-evaluator.hf.space/`, you can also use your own Gradio API endpoint by cloning the [bigcodebench-evaluator](https://huggingface.co/spaces/bigcode/bigcodebench-evaluator) repo and check `Use via API` at the bottom of the HF space page
7173
- `--pass_k`: The `k` in `Pass@k`, default to `[1, 5, 10]`, e.g. `--pass_k 1,5,10` will evaluate `Pass@1`, `Pass@5` and `Pass@10`
7274
- `--calibrated`: Whether to use the calibrated samples, default to `True`
7375
- `--save_pass_rate`: Whether to save the pass rate to a file, default to `True`
@@ -76,6 +78,7 @@ Below are all the arguments for `bigcodebench.evaluate` for the remote evaluatio
7678
- `--max_as_limit`: The maximum address space limit for the execution, default to `30*1024` (30 GB), e.g. `--max_as_limit 20*1024` will evaluate the samples with at most 20 GB
7779
- `--max_data_limit`: The maximum data segment limit for the execution, default to `30*1024` (30 GB), e.g. `--max_data_limit 20*1024` will evaluate the samples with at most 20 GB
7880
- `--max_stack_limit`: The maximum stack limit for the execution, default to `10`, e.g. `--max_stack_limit 20` will evaluate the samples with at most 20 MB
81+
- `--selective_evaluate`: The subset of the dataset to evaluate, default to `""`. You can pass the index of the tasks to evaluate, e.g. `--selective_evaluate 1,2,3` will evaluate the BigCodeBench/1, BigCodeBench/2 and BigCodeBench/3
7982
- `--check_gt_only`: Whether to only check the ground truths, default to `False`
8083
- `--no_gt`: Whether to not check the ground truths, default to `False`
8184

README.md

Lines changed: 16 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,6 @@
1212
<a href="https://pepy.tech/project/bigcodebench"><img src="https://static.pepy.tech/badge/bigcodebench"></a>
1313
<a href="https://github.com/bigcodebench/bigcodebench/blob/master/LICENSE"><img src="https://img.shields.io/pypi/l/bigcodebench"></a>
1414
<a href="https://hub.docker.com/r/bigcodebench/bigcodebench-evaluate" title="Docker-Eval"><img src="https://img.shields.io/docker/image-size/bigcodebench/bigcodebench-evaluate"></a>
15-
<a href="https://hub.docker.com/r/bigcodebench/bigcodebench-generate" title="Docker-Gen"><img src="https://img.shields.io/docker/image-size/bigcodebench/bigcodebench-generate"></a>
1615
</p>
1716

1817
<p align="center">
@@ -40,6 +39,7 @@ BigCodeBench has been trusted by many LLM teams including:
4039
- Allen Institute for Artificial Intelligence (AI2)
4140

4241
## 📰 News
42+
- **[2025-01-22]** We are releasing `bigcodebench==v0.2.2.dev2`, with 163 models evaluated!
4343
- **[2024-10-06]** We are releasing `bigcodebench==v0.2.0`!
4444
- **[2024-10-05]** We create a public code execution API on the [Hugging Face space](https://huggingface.co/spaces/bigcode/bigcodebench-evaluator).
4545
- **[2024-10-01]** We have evaluated 139 models on BigCodeBench-Hard so far. Take a look at the [leaderboard](https://huggingface.co/spaces/bigcode/bigcodebench-leaderboard)!
@@ -111,11 +111,13 @@ We use the greedy decoding as an example to show how to evaluate the generated c
111111
112112
> [!Note]
113113
>
114-
> Remotely executing on `BigCodeBench-Full` typically takes 6-7 minutes, and on `BigCodeBench-Hard` typically takes 4-5 minutes.
114+
> `gradio` backend on `BigCodeBench-Full` typically takes 6-7 minutes, and on `BigCodeBench-Hard` typically takes 4-5 minutes.
115+
> `e2b` backend with default machine on `BigCodeBench-Full` typically takes 25-30 minutes, and on `BigCodeBench-Hard` typically takes 15-20 minutes.
115116
116117
```bash
117118
bigcodebench.evaluate \
118119
--model meta-llama/Meta-Llama-3.1-8B-Instruct \
120+
--execution [e2b|gradio|local] \
119121
--split [complete|instruct] \
120122
--subset [full|hard] \
121123
--backend [vllm|openai|anthropic|google|mistral|hf]
@@ -126,6 +128,12 @@ bigcodebench.evaluate \
126128
- The evaluation results will be stored in a file named `[model_name]--bigcodebench-[instruct|complete]--[backend]-[temp]-[n_samples]-sanitized_calibrated_eval_results.json`.
127129
- The pass@k results will be stored in a file named `[model_name]--bigcodebench-[instruct|complete]--[backend]-[temp]-[n_samples]-sanitized_calibrated_pass_at_k.json`.
128130

131+
> [!Note]
132+
>
133+
> The `gradio` backend is hosted on the [Hugging Face space](https://huggingface.co/spaces/bigcode/bigcodebench-evaluator) by default.
134+
> The default space can be sometimes slow, so we recommend you to use the `e2b` backend for faster evaluation.
135+
> Otherwise, you can also use the `e2b` sandbox for evaluation, which is also pretty slow on the default machine.
136+
129137
> [!Note]
130138
>
131139
> BigCodeBench uses different prompts for base and chat models.
@@ -136,6 +144,12 @@ bigcodebench.evaluate \
136144
> please add `--direct_completion` to avoid being evaluated
137145
> in a chat mode.
138146
147+
To use E2B, you need to set up an account and get an API key from [E2B](https://e2b.dev/).
148+
149+
```bash
150+
export E2B_API_KEY=<your_e2b_api_key>
151+
```
152+
139153
Access OpenAI APIs from [OpenAI Console](https://platform.openai.com/)
140154
```bash
141155
export OPENAI_API_KEY=<your_openai_api_key>

bigcodebench/evaluate.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -119,7 +119,7 @@ def evaluate(
119119
subset: str,
120120
samples: Optional[str] = None,
121121
no_execute: bool = False,
122-
execution: str = "e2b", # "e2b", "gradio", "local"
122+
execution: str = "gradio", # "e2b", "gradio", "local"
123123
selective_evaluate: str = "",
124124
e2b_endpoint: str = "bigcodebench_evaluator",
125125
gradio_endpoint: str = "https://bigcode-bigcodebench-evaluator.hf.space/",

0 commit comments

Comments
 (0)