You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: ADVANCED_USAGE.md
+52-16Lines changed: 52 additions & 16 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -43,7 +43,54 @@ pip install -e .[generate]
43
43
</div>
44
44
</details>
45
45
46
-
### 🚀 Local Generation
46
+
## 🚀 Remote Evaluation
47
+
48
+
Below are all the arguments for `bigcodebench.evaluate` for the remote evaluation:
49
+
50
+
#### Required Arguments:
51
+
-`--model`: The model to evaluate
52
+
-`--split`: The split of the dataset to evaluate
53
+
-`--subset`: The subset of the dataset to evaluate
54
+
55
+
#### Optional Arguments:
56
+
-`--root`: The root directory to store the results, default to `bcb_results`
57
+
-`--bs`: The batch size, default to `1`
58
+
-`--n_samples`: The number of samples, default to `1`
59
+
-`--temperature`: The temperature, default to `0.0`
60
+
-`--max_new_tokens`: The length of max new tokens, default to `1280`
61
+
-`--greedy`: Whether to use greedy decoding, default to `False`
62
+
-`--strip_newlines`: Whether to strip newlines, default to `False`, set to `True` to strip newlines for some model series like StarCoder2
63
+
-`--direct_completion`: Whether to use direct completion, default to `False`
64
+
-`--resume`: Whether to resume the evaluation, default to `True`, set to `False` to re-run the evaluation
65
+
-`--id_range`: The range of the tasks to evaluate, default to `None`, e.g. `--id_range 10,20` will evaluate the tasks from 10 to 20
66
+
-`--backend`: The backend to use, default to `vllm`
67
+
-`--base_url`: The base URL of the backend for OpenAI-compatible APIs, default to `None`
68
+
-`--tp`: The tensor parallel size for the vLLM backend, default to `1`
69
+
-`--trust_remote_code`: Whether to trust the remote code, default to `False`
70
+
-`--tokenizer_name`: The name of the customized tokenizer, default to `None`
71
+
-`--tokenizer_legacy`: Whether to use the legacy tokenizer, default to `False`
72
+
-`--samples`: The path to the generated samples file, default to `None`
73
+
-`--local_execute`: Whether to execute the samples locally, default to `False`
74
+
-`--remote_execute_api`: The API endpoint for remote execution, default to `https://bigcode-bigcodebench-evaluator.hf.space/`, you can also use your own Gradio API endpoint by cloning the [bigcodebench-evaluator](https://github.com/bigcode-project/bigcodebench-evaluator) repo and check `Use via API` at the bottom of the HF space page.
75
+
-`--pass_k`: The `k` in `Pass@k`, default to `[1, 5, 10]`, e.g. `--pass_k 1,5,10` will evaluate `Pass@1`, `Pass@5` and `Pass@10`
76
+
-`--save_pass_rate`: Whether to save the pass rate to a file, default to `True`
77
+
-`--parallel`: The number of parallel processes, default to `None`, e.g. `--parallel 10` will evaluate 10 samples in parallel
78
+
-`--min_time_limit`: The minimum time limit for the execution, default to `1`, e.g. `--min_time_limit 10` will evaluate the samples with at least 10 seconds
79
+
-`--max_as_limit`: The maximum address space limit for the execution, default to `30*1024` (30 GB), e.g. `--max_as_limit 20*1024` will evaluate the samples with at most 20 GB
80
+
-`--max_data_limit`: The maximum data segment limit for the execution, default to `30*1024` (30 GB), e.g. `--max_data_limit 20*1024` will evaluate the samples with at most 20 GB
81
+
-`--max_stack_limit`: The maximum stack limit for the execution, default to `10`, e.g. `--max_stack_limit 20` will evaluate the samples with at most 20 MB
82
+
-`--check_gt_only`: Whether to only check the ground truths, default to `False`
83
+
-`--no_gt`: Whether to not check the ground truths, default to `False`
84
+
85
+
## 🚀 Full Script
86
+
87
+
We provide an example script to run the full pipeline for the remote evaluation:
88
+
89
+
```bash
90
+
bash run.sh
91
+
```
92
+
93
+
## 🚀 Local Generation
47
94
48
95
```bash
49
96
# when greedy, there is no need for temperature and n_samples
@@ -62,9 +109,11 @@ bigcodebench.generate \
62
109
[--base_url [base_url]] \
63
110
[--tokenizer_name [tokenizer_name]]
64
111
```
112
+
65
113
>
66
114
The generated code samples will be stored in a file named `[model_name]--bigcodebench-[instruct|complete]--[backend]-[temp]-[n_samples].jsonl`. Alternatively, you can use the following command to utilize our pre-built docker images for generating code samples:
67
115
>
116
+
68
117
```bash
69
118
# If you are using GPUs
70
119
docker run --gpus '"device=$CUDA_VISIBLE_DEVICES"' -v $(pwd):/app -t bigcodebench/bigcodebench-generate:latest \
bigcodebench.inspect --eval_results sample-sanitized-calibrated_eval_results.json --split complete --subset hard --in_place
251
300
```
252
301
253
-
## 🚀 Full Script
254
-
255
-
We provide a sample script to run the full pipeline:
256
-
257
-
```bash
258
-
bash run.sh
259
-
```
260
-
261
302
## 📊 Result Analysis
262
303
263
304
We provide a script to replicate the analysis like Elo Rating and Task Solve Rate, which helps you understand the performance of the models further.
@@ -270,12 +311,7 @@ cd analysis
270
311
python get_results.py
271
312
```
272
313
273
-
## 💻 LLM-generated Code
274
-
275
-
We share pre-generated code samples from LLMs we have [evaluated](https://huggingface.co/spaces/bigcode/bigcodebench-leaderboard):
276
-
* See the attachment of our [v0.1.5](https://github.com/bigcode-project/bigcodebench/releases/tag/v0.1.5). We include both `sanitized_samples.zip` and `sanitized_samples_calibrated.zip` for your convenience.
277
-
278
-
## 🐞 Known Issues
314
+
## 🐞 Resolved Issues
279
315
280
316
- [x] Due to [the Hugging Face tokenizer update](https://github.com/huggingface/transformers/pull/31305), some tokenizers may be broken and will degrade the performance of the evaluation. Therefore, we set up with `legacy=False` for the initialization. If you notice the unexpected behaviors, please try `--tokenizer_legacy` during the generation.
Copy file name to clipboardExpand all lines: README.md
+2Lines changed: 2 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -100,6 +100,8 @@ bigcodebench.evaluate \
100
100
--tp [TENSOR_PARALLEL_SIZE] \
101
101
--greedy
102
102
```
103
+
104
+
- All the resulted files will be stored in a folder named `bcb_results`.
103
105
- The generated code samples will be stored in a file named `[model_name]--bigcodebench-[instruct|complete]--[backend]-[temp]-[n_samples]-sanitized_calibrated.jsonl`.
104
106
- The evaluation results will be stored in a file named `[model_name]--bigcodebench-[instruct|complete]--[backend]-[temp]-[n_samples]-sanitized_calibrated_eval_results.json`.
105
107
- The pass@k results will be stored in a file named `[model_name]--bigcodebench-[instruct|complete]--[backend]-[temp]-[n_samples]-sanitized_calibrated_pass_at_k.json`.
0 commit comments