Skip to content

Commit 2b73125

Browse files
committed
2 parents 1126230 + 817e63b commit 2b73125

File tree

6 files changed

+25
-30
lines changed

6 files changed

+25
-30
lines changed

ADVANCED_USAGE.md

Lines changed: 8 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -3,25 +3,19 @@
33
To get started, please first set up the environment:
44

55
```bash
6-
# Install to use bigcodebench.evaluate
7-
pip install bigcodebench --upgrade
8-
# If you want to use the evaluate locally, you need to install the requirements
6+
# If you want to use the evaluate locally, you need to install the requirements in an isolated environment
97
pip install -I -r https://raw.githubusercontent.com/bigcode-project/bigcodebench/main/Requirements/requirements-eval.txt
108

11-
# Install to use bigcodebench.generate
12-
# You are strongly recommended to install the generate dependencies in a separate environment
13-
pip install bigcodebench[generate] --upgrade
9+
# You are strongly recommended to install the bigcodebench dependencies in another environment
10+
pip install bigcodebench --upgrade
1411
```
1512

1613
<details><summary>⏬ Install nightly version <i>:: click to expand ::</i></summary>
1714
<div>
1815

1916
```bash
20-
# Install to use bigcodebench.evaluate
17+
# Install to use bigcodebench
2118
pip install "git+https://github.com/bigcode-project/bigcodebench.git" --upgrade
22-
23-
# Install to use bigcodebench.generate
24-
pip install "git+https://github.com/bigcode-project/bigcodebench.git#egg=bigcodebench[generate]" --upgrade
2519
```
2620

2721
</div>
@@ -34,10 +28,8 @@ pip install "git+https://github.com/bigcode-project/bigcodebench.git#egg=bigcode
3428
git clone https://github.com/bigcode-project/bigcodebench.git
3529
cd bigcodebench
3630
export PYTHONPATH=$PYTHONPATH:$(pwd)
37-
# Install to use bigcodebench.evaluate
31+
# Install to use bigcodebench
3832
pip install -e .
39-
# Install to use bigcodebench.generate
40-
pip install -e .[generate]
4133
```
4234

4335
</div>
@@ -71,10 +63,10 @@ Below are all the arguments for `bigcodebench.evaluate` for the remote evaluatio
7163
- `--tokenizer_legacy`: Whether to use the legacy tokenizer, default to `False`
7264
- `--samples`: The path to the generated samples file, default to `None`
7365
- `--local_execute`: Whether to execute the samples locally, default to `False`
74-
- `--remote_execute_api`: The API endpoint for remote execution, default to `https://bigcode-bigcodebench-evaluator.hf.space/`, you can also use your own Gradio API endpoint by cloning the [bigcodebench-evaluator](https://github.com/bigcode-project/bigcodebench-evaluator) repo and check `Use via API` at the bottom of the HF space page.
66+
- `--remote_execute_api`: The API endpoint for remote execution, default to `https://bigcode-bigcodebench-evaluator.hf.space/`, you can also use your own Gradio API endpoint by cloning the [bigcodebench-evaluator](https://huggingface.co/spaces/bigcode/bigcodebench-evaluator) repo and check `Use via API` at the bottom of the HF space page.
7567
- `--pass_k`: The `k` in `Pass@k`, default to `[1, 5, 10]`, e.g. `--pass_k 1,5,10` will evaluate `Pass@1`, `Pass@5` and `Pass@10`
7668
- `--save_pass_rate`: Whether to save the pass rate to a file, default to `True`
77-
- `--parallel`: The number of parallel processes, default to `None`, e.g. `--parallel 10` will evaluate 10 samples in parallel
69+
- `--parallel`: The number of parallel processes, default to `-1`, e.g. `--parallel 10` will evaluate 10 samples in parallel
7870
- `--min_time_limit`: The minimum time limit for the execution, default to `1`, e.g. `--min_time_limit 10` will evaluate the samples with at least 10 seconds
7971
- `--max_as_limit`: The maximum address space limit for the execution, default to `30*1024` (30 GB), e.g. `--max_as_limit 20*1024` will evaluate the samples with at most 20 GB
8072
- `--max_data_limit`: The maximum data segment limit for the execution, default to `30*1024` (30 GB), e.g. `--max_data_limit 20*1024` will evaluate the samples with at most 20 GB
@@ -111,7 +103,7 @@ bigcodebench.generate \
111103
```
112104
113105
>
114-
The generated code samples will be stored in a file named `[model_name]--bigcodebench-[instruct|complete]--[backend]-[temp]-[n_samples].jsonl`. Alternatively, you can use the following command to utilize our pre-built docker images for generating code samples:
106+
The generated code samples will be stored in a file named `[model_name]--bigcodebench-[instruct|complete]--[backend]-[temp]-[n_samples]-sanitized_calibrated.jsonl`. Alternatively, you can use the following command to utilize our pre-built docker images for generating code samples:
115107
>
116108
117109
```bash

Docker/Gradio.Dockerfile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ RUN apt-get update && apt-get install -y git g++ python3-tk zip unzip procps r-b
77
# upgrade to latest pip
88
RUN pip install --upgrade pip
99

10-
RUN pip install APScheduler==3.10.1 black==23.11.0 click==8.1.3 huggingface-hub>=0.18.0 plotly python-dateutil==2.8.2 gradio-space-ci@git+https://huggingface.co/spaces/Wauplin/gradio-space-ci@0.2.3 isort ruff gradio[oauth]==4.31.0 gradio_leaderboard==0.0.11 schedule==1.2.2
10+
RUN pip install APScheduler==3.10.1 black==23.11.0 click==8.1.3 huggingface-hub>=0.18.0 plotly python-dateutil==2.8.2 gradio-space-ci@git+https://huggingface.co/spaces/Wauplin/gradio-space-ci@0.2.3 isort ruff gradio[oauth] schedule==1.2.2
1111

1212
# Add a new user "bigcodebenchuser"
1313
RUN adduser --disabled-password --gecos "" bigcodebenchuser

README.md

Lines changed: 12 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -20,11 +20,11 @@
2020
<a href="#-quick-start">🔥 Quick Start</a> •
2121
<a href="#-remote-evaluation">🚀 Remote Evaluation</a> •
2222
<a href="#-llm-generated-code">💻 LLM-generated Code</a> •
23-
<a href="#-advanced-usage">📜 Advanced Usage</a> •
24-
<a href="#-citation">🙏 Acknowledgement</a>
23+
<a href="#-citation">📜 Citation</a>
2524
</p>
2625

2726
## 📰 News
27+
- **[2024-10-06]** We are releasing `bigcodebench==v0.2.0`!
2828
- **[2024-10-05]** We create a public code execution API on the [Hugging Face space](https://huggingface.co/spaces/bigcode/bigcodebench-evaluator).
2929
- **[2024-10-01]** We have evaluated 139 models on BigCodeBench-Hard so far. Take a look at the [leaderboard](https://huggingface.co/spaces/bigcode/bigcodebench-leaderboard)!
3030
- **[2024-08-19]** To make the evaluation fully reproducible, we add a real-time code execution session to the leaderboard. It can be viewed [here](https://huggingface.co/spaces/bigcode/bigcodebench-leaderboard).
@@ -48,6 +48,10 @@
4848

4949
BigCodeBench is an **_easy-to-use_** benchmark for solving **_practical_** and **_challenging_** tasks via code. It aims to evaluate the true programming capabilities of large language models (LLMs) in a more realistic setting. The benchmark is designed for HumanEval-like function-level code generation tasks, but with much more complex instructions and diverse function calls.
5050

51+
There are two splits in BigCodeBench:
52+
- `Complete`: Thes split is designed for code completion based on the comprehensive docstrings.
53+
- `Instruct`: The split works for the instruction-tuned and chat models only, where the models are asked to generate a code snippet based on the natural language instructions. The instructions only contain necessary information, and require more complex reasoning.
54+
5155
### Why BigCodeBench?
5256

5357
BigCodeBench focuses on task automation via code generation with *diverse function calls* and *complex instructions*, with:
@@ -61,7 +65,7 @@ To get started, please first set up the environment:
6165

6266
```bash
6367
# By default, you will use the remote evaluation API to execute the output samples.
64-
pip install bigcodebench[generate] --upgrade
68+
pip install bigcodebench --upgrade
6569

6670
# You are suggested to use `flash-attn` for generating code samples.
6771
pip install packaging ninja
@@ -75,7 +79,7 @@ pip install flash-attn --no-build-isolation
7579

7680
```bash
7781
# Install to use bigcodebench.generate
78-
pip install "git+https://github.com/bigcode-project/bigcodebench.git#egg=bigcodebench[generate]" --upgrade
82+
pip install "git+https://github.com/bigcode-project/bigcodebench.git" --upgrade
7983
```
8084

8185
</div>
@@ -85,6 +89,9 @@ pip install "git+https://github.com/bigcode-project/bigcodebench.git#egg=bigcode
8589
## 🚀 Remote Evaluation
8690

8791
We use the greedy decoding as an example to show how to evaluate the generated code samples via remote API.
92+
> [!Warning]
93+
>
94+
> To ease the generation, we use batch inference by default. However, the batch inference results could vary from *batch sizes to batch sizes* and *versions to versions*, at least for the vLLM backend. If you want to get more deterministic results for greedy decoding, please set `--bs` to `1`.
8895
8996
> [!Note]
9097
>
@@ -136,7 +143,7 @@ export GOOGLE_API_KEY=<your_google_api_key>
136143
## 💻 LLM-generated Code
137144

138145
We share pre-generated code samples from LLMs we have [evaluated](https://huggingface.co/spaces/bigcode/bigcodebench-leaderboard):
139-
* See the attachment of our [v0.1.5](https://github.com/bigcode-project/bigcodebench/releases/tag/v0.1.5). We include both `sanitized_samples.zip` and `sanitized_samples_calibrated.zip` for your convenience.
146+
* See the attachment of our [v0.2.0.post3](https://github.com/bigcode-project/bigcodebench/releases/tag/v0.2.0.post3). We include `sanitized_samples_calibrated.zip` for your convenience.
140147

141148
## Advanced Usage
142149

bigcodebench/evaluate.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -119,7 +119,7 @@ def evaluate(
119119
remote_execute_api: str = "https://bigcode-bigcodebench-evaluator.hf.space/",
120120
pass_k: str = "1,5,10",
121121
save_pass_rate: bool = True,
122-
parallel: int = None,
122+
parallel: int = -1,
123123
min_time_limit: float = 1,
124124
max_as_limit: int = 30*1024,
125125
max_data_limit: int = 30*1024,
@@ -167,7 +167,7 @@ def evaluate(
167167

168168
pass_k = [int(k) for k in pass_k.split(",")]
169169

170-
if parallel is None:
170+
if parallel < 1:
171171
n_workers = max(1, multiprocessing.cpu_count() // 2)
172172
else:
173173
n_workers = parallel

run.sh

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -9,5 +9,4 @@ bigcodebench.evaluate \
99
--model $MODEL \
1010
--split $SPLIT \
1111
--subset $SUBSET \
12-
--backend $BACKEND \
13-
--tp $NUM_GPU
12+
--backend $BACKEND

setup.cfg

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -29,9 +29,6 @@ install_requires =
2929
wget>=3.2
3030
datasets
3131
gradio-client
32-
33-
[options.extras_require]
34-
generate =
3532
vllm
3633
numpy
3734
rich
@@ -48,4 +45,4 @@ console_scripts =
4845
bigcodebench.syncheck = bigcodebench.syncheck:main
4946
bigcodebench.legacy_sanitize = bigcodebench.legacy_sanitize:main
5047
bigcodebench.generate = bigcodebench.generate:main
51-
bigcodebench.inspect = bigcodebench.inspect:main
48+
bigcodebench.inspect = bigcodebench.inspect:main

0 commit comments

Comments
 (0)