update doc

terryyz · terryyz · commit 6ffa085a344e · 2025-01-23T02:27:01.000+08:00
diff --git a/ADVANCED_USAGE.md b/ADVANCED_USAGE.md
@@ -56,6 +56,8 @@ Below are all the arguments for `bigcodebench.evaluate` for the remote evaluatio
 - `--resume`: Whether to resume the evaluation, default to `True`, set to `False` to re-run the evaluation
 - `--id_range`: The range of the tasks to evaluate, default to `None`, e.g. `--id_range 10-20` will evaluate the tasks from 10 to 20
 - `--backend`: The backend to use, default to `vllm`
+- `--execution`: The execution backend to use, default to `gradio`. You can choose from `e2b`, `gradio`, `local`.
+- `--reasoning_effort`: The reasoning effort to use, default to `medium`. You can choose from `easy`, `medium`, `hard` for `o1`, `o3` and `deepseek-reasoner`(soon) models.
 - `--base_url`: The base URL of the backend for OpenAI-compatible APIs, default to `None`
 - `--instruction_prefix`: The instruction prefix for the Anthropic backend, default to `None`
 - `--response_prefix`: The response prefix for the Anthropic backend, default to `None`
@@ -67,7 +69,7 @@ Below are all the arguments for `bigcodebench.evaluate` for the remote evaluatio
 - `--samples`: The path to the generated samples file, default to `None`
 - `--no_execute`: Whether to not execute the samples, default to `False`
 - `--local_execute`: Whether to execute the samples locally, default to `False`
-- `--remote_execute_api`: The API endpoint for remote execution, default to `https://bigcode-bigcodebench-evaluator.hf.space/`, you can also use your own Gradio API endpoint by cloning the [bigcodebench-evaluator](https://huggingface.co/spaces/bigcode/bigcodebench-evaluator) repo and check `Use via API` at the bottom of the HF space page.
+- `--remote_execute_api`: The API endpoint for remote execution, default to `https://bigcode-bigcodebench-evaluator.hf.space/`, you can also use your own Gradio API endpoint by cloning the [bigcodebench-evaluator](https://huggingface.co/spaces/bigcode/bigcodebench-evaluator) repo and check `Use via API` at the bottom of the HF space page
 - `--pass_k`: The `k` in `Pass@k`, default to `[1, 5, 10]`, e.g. `--pass_k 1,5,10` will evaluate `Pass@1`, `Pass@5` and `Pass@10`
 - `--calibrated`: Whether to use the calibrated samples, default to `True`
 - `--save_pass_rate`: Whether to save the pass rate to a file, default to `True`
@@ -76,6 +78,7 @@ Below are all the arguments for `bigcodebench.evaluate` for the remote evaluatio
 - `--max_as_limit`: The maximum address space limit for the execution, default to `30*1024` (30 GB), e.g. `--max_as_limit 20*1024` will evaluate the samples with at most 20 GB
 - `--max_data_limit`: The maximum data segment limit for the execution, default to `30*1024` (30 GB), e.g. `--max_data_limit 20*1024` will evaluate the samples with at most 20 GB
 - `--max_stack_limit`: The maximum stack limit for the execution, default to `10`, e.g. `--max_stack_limit 20` will evaluate the samples with at most 20 MB
+- `--selective_evaluate`: The subset of the dataset to evaluate, default to `""`. You can pass the index of the tasks to evaluate, e.g. `--selective_evaluate 1,2,3` will evaluate the BigCodeBench/1, BigCodeBench/2 and BigCodeBench/3
 - `--check_gt_only`: Whether to only check the ground truths, default to `False`
 - `--no_gt`: Whether to not check the ground truths, default to `False`
 
diff --git a/README.md b/README.md
@@ -12,7 +12,6 @@
     <a href="https://pepy.tech/project/bigcodebench"><img src="https://static.pepy.tech/badge/bigcodebench"></a>
     <a href="https://github.com/bigcodebench/bigcodebench/blob/master/LICENSE"><img src="https://img.shields.io/pypi/l/bigcodebench"></a>
     <a href="https://hub.docker.com/r/bigcodebench/bigcodebench-evaluate" title="Docker-Eval"><img src="https://img.shields.io/docker/image-size/bigcodebench/bigcodebench-evaluate"></a>
-    <a href="https://hub.docker.com/r/bigcodebench/bigcodebench-generate" title="Docker-Gen"><img src="https://img.shields.io/docker/image-size/bigcodebench/bigcodebench-generate"></a>
 </p>
 
 <p align="center">
@@ -40,6 +39,7 @@ BigCodeBench has been trusted by many LLM teams including:
 - Allen Institute for Artificial Intelligence (AI2)
 
 ## 📰 News
+- **[2025-01-22]** We are releasing `bigcodebench==v0.2.2.dev2`, with 163 models evaluated!
 - **[2024-10-06]** We are releasing `bigcodebench==v0.2.0`!
 - **[2024-10-05]** We create a public code execution API on the [Hugging Face space](https://huggingface.co/spaces/bigcode/bigcodebench-evaluator).
 - **[2024-10-01]** We have evaluated 139 models on BigCodeBench-Hard so far. Take a look at the [leaderboard](https://huggingface.co/spaces/bigcode/bigcodebench-leaderboard)!
@@ -111,11 +111,13 @@ We use the greedy decoding as an example to show how to evaluate the generated c
 
 > [!Note]
 >
-> Remotely executing on `BigCodeBench-Full` typically takes 6-7 minutes, and on `BigCodeBench-Hard` typically takes 4-5 minutes.
+> `gradio` backend on `BigCodeBench-Full` typically takes 6-7 minutes, and on `BigCodeBench-Hard` typically takes 4-5 minutes.
+> `e2b` backend with default machine on `BigCodeBench-Full` typically takes 25-30 minutes, and on `BigCodeBench-Hard` typically takes 15-20 minutes.
 
 ```bash
 bigcodebench.evaluate \
   --model meta-llama/Meta-Llama-3.1-8B-Instruct \
+  --execution [e2b|gradio|local] \
   --split [complete|instruct] \
   --subset [full|hard] \
   --backend [vllm|openai|anthropic|google|mistral|hf]
@@ -126,6 +128,12 @@ bigcodebench.evaluate \
 - The evaluation results will be stored in a file named `[model_name]--bigcodebench-[instruct|complete]--[backend]-[temp]-[n_samples]-sanitized_calibrated_eval_results.json`.
 - The pass@k results will be stored in a file named `[model_name]--bigcodebench-[instruct|complete]--[backend]-[temp]-[n_samples]-sanitized_calibrated_pass_at_k.json`.
 
+> [!Note]
+>
+> The `gradio` backend is hosted on the [Hugging Face space](https://huggingface.co/spaces/bigcode/bigcodebench-evaluator) by default.
+> The default space can be sometimes slow, so we recommend you to use the `e2b` backend for faster evaluation.
+> Otherwise, you can also use the `e2b` sandbox for evaluation, which is also pretty slow on the default machine.
+
 > [!Note]
 >
 > BigCodeBench uses different prompts for base and chat models.
@@ -136,6 +144,12 @@ bigcodebench.evaluate \
 > please add `--direct_completion` to avoid being evaluated
 > in a chat mode.
 
+To use E2B, you need to set up an account and get an API key from [E2B](https://e2b.dev/).
+
+```bash
+export E2B_API_KEY=<your_e2b_api_key>
+```
+
 Access OpenAI APIs from [OpenAI Console](https://platform.openai.com/)
 ```bash
 export OPENAI_API_KEY=<your_openai_api_key>
diff --git a/bigcodebench/evaluate.py b/bigcodebench/evaluate.py
@@ -119,7 +119,7 @@ def evaluate(
     subset: str,
     samples: Optional[str] = None,
     no_execute: bool = False,
-    execution: str = "e2b", # "e2b", "gradio", "local"
+    execution: str = "gradio", # "e2b", "gradio", "local"
     selective_evaluate: str = "",
     e2b_endpoint: str = "bigcodebench_evaluator",
     gradio_endpoint: str = "https://bigcode-bigcodebench-evaluator.hf.space/",