You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -71,10 +63,10 @@ Below are all the arguments for `bigcodebench.evaluate` for the remote evaluatio
71
63
-`--tokenizer_legacy`: Whether to use the legacy tokenizer, default to `False`
72
64
-`--samples`: The path to the generated samples file, default to `None`
73
65
-`--local_execute`: Whether to execute the samples locally, default to `False`
74
-
-`--remote_execute_api`: The API endpoint for remote execution, default to `https://bigcode-bigcodebench-evaluator.hf.space/`, you can also use your own Gradio API endpoint by cloning the [bigcodebench-evaluator](https://github.com/bigcode-project/bigcodebench-evaluator) repo and check `Use via API` at the bottom of the HF space page.
66
+
-`--remote_execute_api`: The API endpoint for remote execution, default to `https://bigcode-bigcodebench-evaluator.hf.space/`, you can also use your own Gradio API endpoint by cloning the [bigcodebench-evaluator](https://huggingface.co/spaces/bigcode/bigcodebench-evaluator) repo and check `Use via API` at the bottom of the HF space page.
75
67
-`--pass_k`: The `k` in `Pass@k`, default to `[1, 5, 10]`, e.g. `--pass_k 1,5,10` will evaluate `Pass@1`, `Pass@5` and `Pass@10`
76
68
-`--save_pass_rate`: Whether to save the pass rate to a file, default to `True`
77
-
-`--parallel`: The number of parallel processes, default to `None`, e.g. `--parallel 10` will evaluate 10 samples in parallel
69
+
-`--parallel`: The number of parallel processes, default to `-1`, e.g. `--parallel 10` will evaluate 10 samples in parallel
78
70
-`--min_time_limit`: The minimum time limit for the execution, default to `1`, e.g. `--min_time_limit 10` will evaluate the samples with at least 10 seconds
79
71
-`--max_as_limit`: The maximum address space limit for the execution, default to `30*1024` (30 GB), e.g. `--max_as_limit 20*1024` will evaluate the samples with at most 20 GB
80
72
-`--max_data_limit`: The maximum data segment limit for the execution, default to `30*1024` (30 GB), e.g. `--max_data_limit 20*1024` will evaluate the samples with at most 20 GB
@@ -111,7 +103,7 @@ bigcodebench.generate \
111
103
```
112
104
113
105
>
114
-
The generated code samples will be stored in a file named `[model_name]--bigcodebench-[instruct|complete]--[backend]-[temp]-[n_samples].jsonl`. Alternatively, you can use the following command to utilize our pre-built docker images for generating code samples:
106
+
The generated code samples will be stored in a file named `[model_name]--bigcodebench-[instruct|complete]--[backend]-[temp]-[n_samples]-sanitized_calibrated.jsonl`. Alternatively, you can use the following command to utilize our pre-built docker images for generating code samples:
-**[2024-10-06]** We are releasing `bigcodebench==v0.2.0`!
28
28
-**[2024-10-05]** We create a public code execution API on the [Hugging Face space](https://huggingface.co/spaces/bigcode/bigcodebench-evaluator).
29
29
-**[2024-10-01]** We have evaluated 139 models on BigCodeBench-Hard so far. Take a look at the [leaderboard](https://huggingface.co/spaces/bigcode/bigcodebench-leaderboard)!
30
30
-**[2024-08-19]** To make the evaluation fully reproducible, we add a real-time code execution session to the leaderboard. It can be viewed [here](https://huggingface.co/spaces/bigcode/bigcodebench-leaderboard).
@@ -48,6 +48,10 @@
48
48
49
49
BigCodeBench is an **_easy-to-use_** benchmark for solving **_practical_** and **_challenging_** tasks via code. It aims to evaluate the true programming capabilities of large language models (LLMs) in a more realistic setting. The benchmark is designed for HumanEval-like function-level code generation tasks, but with much more complex instructions and diverse function calls.
50
50
51
+
There are two splits in BigCodeBench:
52
+
-`Complete`: Thes split is designed for code completion based on the comprehensive docstrings.
53
+
-`Instruct`: The split works for the instruction-tuned and chat models only, where the models are asked to generate a code snippet based on the natural language instructions. The instructions only contain necessary information, and require more complex reasoning.
54
+
51
55
### Why BigCodeBench?
52
56
53
57
BigCodeBench focuses on task automation via code generation with *diverse function calls* and *complex instructions*, with:
@@ -61,7 +65,7 @@ To get started, please first set up the environment:
61
65
62
66
```bash
63
67
# By default, you will use the remote evaluation API to execute the output samples.
64
-
pip install bigcodebench[generate] --upgrade
68
+
pip install bigcodebench --upgrade
65
69
66
70
# You are suggested to use `flash-attn` for generating code samples.
We use the greedy decoding as an example to show how to evaluate the generated code samples via remote API.
92
+
> [!Warning]
93
+
>
94
+
> To ease the generation, we use batch inference by default. However, the batch inference results could vary from *batch sizes to batch sizes* and *versions to versions*, at least for the vLLM backend. If you want to get more deterministic results for greedy decoding, please set `--bs` to `1`.
We share pre-generated code samples from LLMs we have [evaluated](https://huggingface.co/spaces/bigcode/bigcodebench-leaderboard):
139
-
* See the attachment of our [v0.1.5](https://github.com/bigcode-project/bigcodebench/releases/tag/v0.1.5). We include both `sanitized_samples.zip` and`sanitized_samples_calibrated.zip` for your convenience.
146
+
* See the attachment of our [v0.2.0.post3](https://github.com/bigcode-project/bigcodebench/releases/tag/v0.2.0.post3). We include `sanitized_samples_calibrated.zip` for your convenience.
0 commit comments