You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: ADVANCED_USAGE.md
+4-1Lines changed: 4 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -56,6 +56,8 @@ Below are all the arguments for `bigcodebench.evaluate` for the remote evaluatio
56
56
-`--resume`: Whether to resume the evaluation, default to `True`, set to `False` to re-run the evaluation
57
57
-`--id_range`: The range of the tasks to evaluate, default to `None`, e.g. `--id_range 10-20` will evaluate the tasks from 10 to 20
58
58
-`--backend`: The backend to use, default to `vllm`
59
+
-`--execution`: The execution backend to use, default to `gradio`. You can choose from `e2b`, `gradio`, `local`.
60
+
-`--reasoning_effort`: The reasoning effort to use, default to `medium`. You can choose from `easy`, `medium`, `hard` for `o1`, `o3` and `deepseek-reasoner`(soon) models.
59
61
-`--base_url`: The base URL of the backend for OpenAI-compatible APIs, default to `None`
60
62
-`--instruction_prefix`: The instruction prefix for the Anthropic backend, default to `None`
61
63
-`--response_prefix`: The response prefix for the Anthropic backend, default to `None`
@@ -67,7 +69,7 @@ Below are all the arguments for `bigcodebench.evaluate` for the remote evaluatio
67
69
-`--samples`: The path to the generated samples file, default to `None`
68
70
-`--no_execute`: Whether to not execute the samples, default to `False`
69
71
-`--local_execute`: Whether to execute the samples locally, default to `False`
70
-
-`--remote_execute_api`: The API endpoint for remote execution, default to `https://bigcode-bigcodebench-evaluator.hf.space/`, you can also use your own Gradio API endpoint by cloning the [bigcodebench-evaluator](https://huggingface.co/spaces/bigcode/bigcodebench-evaluator) repo and check `Use via API` at the bottom of the HF space page.
72
+
-`--remote_execute_api`: The API endpoint for remote execution, default to `https://bigcode-bigcodebench-evaluator.hf.space/`, you can also use your own Gradio API endpoint by cloning the [bigcodebench-evaluator](https://huggingface.co/spaces/bigcode/bigcodebench-evaluator) repo and check `Use via API` at the bottom of the HF space page
71
73
-`--pass_k`: The `k` in `Pass@k`, default to `[1, 5, 10]`, e.g. `--pass_k 1,5,10` will evaluate `Pass@1`, `Pass@5` and `Pass@10`
72
74
-`--calibrated`: Whether to use the calibrated samples, default to `True`
73
75
-`--save_pass_rate`: Whether to save the pass rate to a file, default to `True`
@@ -76,6 +78,7 @@ Below are all the arguments for `bigcodebench.evaluate` for the remote evaluatio
76
78
-`--max_as_limit`: The maximum address space limit for the execution, default to `30*1024` (30 GB), e.g. `--max_as_limit 20*1024` will evaluate the samples with at most 20 GB
77
79
-`--max_data_limit`: The maximum data segment limit for the execution, default to `30*1024` (30 GB), e.g. `--max_data_limit 20*1024` will evaluate the samples with at most 20 GB
78
80
-`--max_stack_limit`: The maximum stack limit for the execution, default to `10`, e.g. `--max_stack_limit 20` will evaluate the samples with at most 20 MB
81
+
-`--selective_evaluate`: The subset of the dataset to evaluate, default to `""`. You can pass the index of the tasks to evaluate, e.g. `--selective_evaluate 1,2,3` will evaluate the BigCodeBench/1, BigCodeBench/2 and BigCodeBench/3
79
82
-`--check_gt_only`: Whether to only check the ground truths, default to `False`
80
83
-`--no_gt`: Whether to not check the ground truths, default to `False`
@@ -40,6 +39,7 @@ BigCodeBench has been trusted by many LLM teams including:
40
39
- Allen Institute for Artificial Intelligence (AI2)
41
40
42
41
## 📰 News
42
+
-**[2025-01-22]** We are releasing `bigcodebench==v0.2.2.dev2`, with 163 models evaluated!
43
43
-**[2024-10-06]** We are releasing `bigcodebench==v0.2.0`!
44
44
-**[2024-10-05]** We create a public code execution API on the [Hugging Face space](https://huggingface.co/spaces/bigcode/bigcodebench-evaluator).
45
45
-**[2024-10-01]** We have evaluated 139 models on BigCodeBench-Hard so far. Take a look at the [leaderboard](https://huggingface.co/spaces/bigcode/bigcodebench-leaderboard)!
@@ -111,11 +111,13 @@ We use the greedy decoding as an example to show how to evaluate the generated c
111
111
112
112
> [!Note]
113
113
>
114
-
> Remotely executing on `BigCodeBench-Full` typically takes 6-7 minutes, and on `BigCodeBench-Hard` typically takes 4-5 minutes.
114
+
> `gradio` backend on `BigCodeBench-Full` typically takes 6-7 minutes, and on `BigCodeBench-Hard` typically takes 4-5 minutes.
115
+
> `e2b` backend with default machine on `BigCodeBench-Full` typically takes 25-30 minutes, and on `BigCodeBench-Hard` typically takes 15-20 minutes.
- The evaluation results will be stored in a file named `[model_name]--bigcodebench-[instruct|complete]--[backend]-[temp]-[n_samples]-sanitized_calibrated_eval_results.json`.
127
129
- The pass@k results will be stored in a file named `[model_name]--bigcodebench-[instruct|complete]--[backend]-[temp]-[n_samples]-sanitized_calibrated_pass_at_k.json`.
128
130
131
+
> [!Note]
132
+
>
133
+
> The `gradio` backend is hosted on the [Hugging Face space](https://huggingface.co/spaces/bigcode/bigcodebench-evaluator) by default.
134
+
> The default space can be sometimes slow, so we recommend you to use the `e2b` backend for faster evaluation.
135
+
> Otherwise, you can also use the `e2b` sandbox for evaluation, which is also pretty slow on the default machine.
136
+
129
137
> [!Note]
130
138
>
131
139
> BigCodeBench uses different prompts for base and chat models.
@@ -136,6 +144,12 @@ bigcodebench.evaluate \
136
144
> please add `--direct_completion` to avoid being evaluated
137
145
> in a chat mode.
138
146
147
+
To use E2B, you need to set up an account and get an API key from [E2B](https://e2b.dev/).
148
+
149
+
```bash
150
+
export E2B_API_KEY=<your_e2b_api_key>
151
+
```
152
+
139
153
Access OpenAI APIs from [OpenAI Console](https://platform.openai.com/)
0 commit comments