Merge branch 'main' of https://github.com/bigcode-project/bigcodebench

terryyz · terryyz · commit 2b731258d32c · 2024-10-08T16:01:49.000+08:00
diff --git a/ADVANCED_USAGE.md b/ADVANCED_USAGE.md
@@ -3,25 +3,19 @@
 To get started, please first set up the environment:
 
 ```bash
-# Install to use bigcodebench.evaluate
-pip install bigcodebench --upgrade
-# If you want to use the evaluate locally, you need to install the requirements
+# If you want to use the evaluate locally, you need to install the requirements in an isolated environment
 pip install -I -r https://raw.githubusercontent.com/bigcode-project/bigcodebench/main/Requirements/requirements-eval.txt
 
-# Install to use bigcodebench.generate
-# You are strongly recommended to install the generate dependencies in a separate environment
-pip install bigcodebench[generate] --upgrade
+# You are strongly recommended to install the bigcodebench dependencies in another environment
+pip install bigcodebench --upgrade
 ```
 
 <details><summary>⏬ Install nightly version <i>:: click to expand ::</i></summary>
 <div>
 
 ```bash
-# Install to use bigcodebench.evaluate
+# Install to use bigcodebench
 pip install "git+https://github.com/bigcode-project/bigcodebench.git" --upgrade
-
-# Install to use bigcodebench.generate
-pip install "git+https://github.com/bigcode-project/bigcodebench.git#egg=bigcodebench[generate]" --upgrade
 ```
 
 </div>
@@ -34,10 +28,8 @@ pip install "git+https://github.com/bigcode-project/bigcodebench.git#egg=bigcode
 git clone https://github.com/bigcode-project/bigcodebench.git
 cd bigcodebench
 export PYTHONPATH=$PYTHONPATH:$(pwd)
-# Install to use bigcodebench.evaluate
+# Install to use bigcodebench
 pip install -e .
-# Install to use bigcodebench.generate
-pip install -e .[generate]
 ```
 
 </div>
@@ -71,10 +63,10 @@ Below are all the arguments for `bigcodebench.evaluate` for the remote evaluatio
 - `--tokenizer_legacy`: Whether to use the legacy tokenizer, default to `False`
 - `--samples`: The path to the generated samples file, default to `None`
 - `--local_execute`: Whether to execute the samples locally, default to `False`
-- `--remote_execute_api`: The API endpoint for remote execution, default to `https://bigcode-bigcodebench-evaluator.hf.space/`, you can also use your own Gradio API endpoint by cloning the [bigcodebench-evaluator](https://github.com/bigcode-project/bigcodebench-evaluator) repo and check `Use via API` at the bottom of the HF space page.
+- `--remote_execute_api`: The API endpoint for remote execution, default to `https://bigcode-bigcodebench-evaluator.hf.space/`, you can also use your own Gradio API endpoint by cloning the [bigcodebench-evaluator](https://huggingface.co/spaces/bigcode/bigcodebench-evaluator) repo and check `Use via API` at the bottom of the HF space page.
 - `--pass_k`: The `k` in `Pass@k`, default to `[1, 5, 10]`, e.g. `--pass_k 1,5,10` will evaluate `Pass@1`, `Pass@5` and `Pass@10`
 - `--save_pass_rate`: Whether to save the pass rate to a file, default to `True`
-- `--parallel`: The number of parallel processes, default to `None`, e.g. `--parallel 10` will evaluate 10 samples in parallel
+- `--parallel`: The number of parallel processes, default to `-1`, e.g. `--parallel 10` will evaluate 10 samples in parallel
 - `--min_time_limit`: The minimum time limit for the execution, default to `1`, e.g. `--min_time_limit 10` will evaluate the samples with at least 10 seconds
 - `--max_as_limit`: The maximum address space limit for the execution, default to `30*1024` (30 GB), e.g. `--max_as_limit 20*1024` will evaluate the samples with at most 20 GB
 - `--max_data_limit`: The maximum data segment limit for the execution, default to `30*1024` (30 GB), e.g. `--max_data_limit 20*1024` will evaluate the samples with at most 20 GB
@@ -111,7 +103,7 @@ bigcodebench.generate \
 ```
 
 >
-The generated code samples will be stored in a file named `[model_name]--bigcodebench-[instruct|complete]--[backend]-[temp]-[n_samples].jsonl`. Alternatively, you can use the following command to utilize our pre-built docker images for generating code samples:
+The generated code samples will be stored in a file named `[model_name]--bigcodebench-[instruct|complete]--[backend]-[temp]-[n_samples]-sanitized_calibrated.jsonl`. Alternatively, you can use the following command to utilize our pre-built docker images for generating code samples:
 >
 
 ```bash
diff --git a/Docker/Gradio.Dockerfile b/Docker/Gradio.Dockerfile
@@ -7,7 +7,7 @@ RUN apt-get update && apt-get install -y git g++ python3-tk zip unzip procps r-b
 # upgrade to latest pip
 RUN pip install --upgrade pip
 
-RUN pip install APScheduler==3.10.1 black==23.11.0 click==8.1.3 huggingface-hub>=0.18.0 plotly python-dateutil==2.8.2 gradio-space-ci@git+https://huggingface.co/spaces/Wauplin/gradio-space-ci@0.2.3 isort ruff gradio[oauth]==4.31.0 gradio_leaderboard==0.0.11 schedule==1.2.2
+RUN pip install APScheduler==3.10.1 black==23.11.0 click==8.1.3 huggingface-hub>=0.18.0 plotly python-dateutil==2.8.2 gradio-space-ci@git+https://huggingface.co/spaces/Wauplin/gradio-space-ci@0.2.3 isort ruff gradio[oauth] schedule==1.2.2
 
 # Add a new user "bigcodebenchuser"
 RUN adduser --disabled-password --gecos "" bigcodebenchuser
diff --git a/README.md b/README.md
@@ -20,11 +20,11 @@
     <a href="#-quick-start">🔥 Quick Start</a> •
     <a href="#-remote-evaluation">🚀 Remote Evaluation</a> •
     <a href="#-llm-generated-code">💻 LLM-generated Code</a> •
-    <a href="#-advanced-usage">📜 Advanced Usage</a> •
-    <a href="#-citation">🙏 Acknowledgement</a>
+    <a href="#-citation">📜 Citation</a>
 </p>
 
 ## 📰 News
+- **[2024-10-06]** We are releasing `bigcodebench==v0.2.0`!
 - **[2024-10-05]** We create a public code execution API on the [Hugging Face space](https://huggingface.co/spaces/bigcode/bigcodebench-evaluator).
 - **[2024-10-01]** We have evaluated 139 models on BigCodeBench-Hard so far. Take a look at the [leaderboard](https://huggingface.co/spaces/bigcode/bigcodebench-leaderboard)!
 - **[2024-08-19]** To make the evaluation fully reproducible, we add a real-time code execution session to the leaderboard. It can be viewed [here](https://huggingface.co/spaces/bigcode/bigcodebench-leaderboard).
@@ -48,6 +48,10 @@
 
 BigCodeBench is an **_easy-to-use_** benchmark for solving **_practical_** and **_challenging_** tasks via code. It aims to evaluate the true programming capabilities of large language models (LLMs) in a more realistic setting. The benchmark is designed for HumanEval-like function-level code generation tasks, but with much more complex instructions and diverse function calls.
 
+There are two splits in BigCodeBench:
+- `Complete`: Thes split is designed for code completion based on the comprehensive docstrings.
+- `Instruct`: The split works for the instruction-tuned and chat models only, where the models are asked to generate a code snippet based on the natural language instructions. The instructions only contain necessary information, and require more complex reasoning.
+
 ### Why BigCodeBench?
 
 BigCodeBench focuses on task automation via code generation with *diverse function calls* and *complex instructions*, with:
@@ -61,7 +65,7 @@ To get started, please first set up the environment:
 
 ```bash
 # By default, you will use the remote evaluation API to execute the output samples.
-pip install bigcodebench[generate] --upgrade
+pip install bigcodebench --upgrade
 
 # You are suggested to use `flash-attn` for generating code samples.
 pip install packaging ninja
@@ -75,7 +79,7 @@ pip install flash-attn --no-build-isolation
 
 ```bash
 # Install to use bigcodebench.generate
-pip install "git+https://github.com/bigcode-project/bigcodebench.git#egg=bigcodebench[generate]" --upgrade
+pip install "git+https://github.com/bigcode-project/bigcodebench.git" --upgrade
 ```
 
 </div>
@@ -85,6 +89,9 @@ pip install "git+https://github.com/bigcode-project/bigcodebench.git#egg=bigcode
 ## 🚀 Remote Evaluation
 
 We use the greedy decoding as an example to show how to evaluate the generated code samples via remote API.
+> [!Warning]
+>
+> To ease the generation, we use batch inference by default. However, the batch inference results could vary from *batch sizes to batch sizes* and *versions to versions*, at least for the vLLM backend. If you want to get more deterministic results for greedy decoding, please set `--bs` to `1`. 
 
 > [!Note]
 >
@@ -136,7 +143,7 @@ export GOOGLE_API_KEY=<your_google_api_key>
 ## 💻 LLM-generated Code
 
 We share pre-generated code samples from LLMs we have [evaluated](https://huggingface.co/spaces/bigcode/bigcodebench-leaderboard):
-*  See the attachment of our [v0.1.5](https://github.com/bigcode-project/bigcodebench/releases/tag/v0.1.5). We include both `sanitized_samples.zip` and `sanitized_samples_calibrated.zip` for your convenience.
+*  See the attachment of our [v0.2.0.post3](https://github.com/bigcode-project/bigcodebench/releases/tag/v0.2.0.post3). We include `sanitized_samples_calibrated.zip` for your convenience.
 
 ## Advanced Usage
 
diff --git a/bigcodebench/evaluate.py b/bigcodebench/evaluate.py
@@ -119,7 +119,7 @@ def evaluate(
     remote_execute_api: str = "https://bigcode-bigcodebench-evaluator.hf.space/",
     pass_k: str = "1,5,10",
     save_pass_rate: bool = True,
-    parallel: int = None,
+    parallel: int = -1,
     min_time_limit: float = 1,
     max_as_limit: int = 30*1024,
     max_data_limit: int = 30*1024,
@@ -167,7 +167,7 @@ def evaluate(
         
         pass_k = [int(k) for k in pass_k.split(",")]
         
-        if parallel is None:
+        if parallel < 1:
             n_workers = max(1, multiprocessing.cpu_count() // 2)
         else:
             n_workers = parallel
diff --git a/run.sh b/run.sh
@@ -9,5 +9,4 @@ bigcodebench.evaluate \
   --model $MODEL \
   --split $SPLIT \
   --subset $SUBSET \
-  --backend $BACKEND \
-  --tp $NUM_GPU
+  --backend $BACKEND
diff --git a/setup.cfg b/setup.cfg
@@ -29,9 +29,6 @@ install_requires =
     wget>=3.2
     datasets
     gradio-client
-
-[options.extras_require]
-generate =
     vllm
     numpy
     rich
@@ -48,4 +45,4 @@ console_scripts =
     bigcodebench.syncheck = bigcodebench.syncheck:main
     bigcodebench.legacy_sanitize = bigcodebench.legacy_sanitize:main
     bigcodebench.generate = bigcodebench.generate:main
-    bigcodebench.inspect = bigcodebench.inspect:main
+    bigcodebench.inspect = bigcodebench.inspect:main