Skip to content

Commit ed9886c

Browse files
committed
doc: add args details
1 parent d367f5b commit ed9886c

File tree

2 files changed

+54
-16
lines changed

2 files changed

+54
-16
lines changed

ADVANCED_USAGE.md

Lines changed: 52 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,54 @@ pip install -e .[generate]
4343
</div>
4444
</details>
4545

46-
### 🚀 Local Generation
46+
## 🚀 Remote Evaluation
47+
48+
Below are all the arguments for `bigcodebench.evaluate` for the remote evaluation:
49+
50+
#### Required Arguments:
51+
- `--model`: The model to evaluate
52+
- `--split`: The split of the dataset to evaluate
53+
- `--subset`: The subset of the dataset to evaluate
54+
55+
#### Optional Arguments:
56+
- `--root`: The root directory to store the results, default to `bcb_results`
57+
- `--bs`: The batch size, default to `1`
58+
- `--n_samples`: The number of samples, default to `1`
59+
- `--temperature`: The temperature, default to `0.0`
60+
- `--max_new_tokens`: The length of max new tokens, default to `1280`
61+
- `--greedy`: Whether to use greedy decoding, default to `False`
62+
- `--strip_newlines`: Whether to strip newlines, default to `False`, set to `True` to strip newlines for some model series like StarCoder2
63+
- `--direct_completion`: Whether to use direct completion, default to `False`
64+
- `--resume`: Whether to resume the evaluation, default to `True`, set to `False` to re-run the evaluation
65+
- `--id_range`: The range of the tasks to evaluate, default to `None`, e.g. `--id_range 10,20` will evaluate the tasks from 10 to 20
66+
- `--backend`: The backend to use, default to `vllm`
67+
- `--base_url`: The base URL of the backend for OpenAI-compatible APIs, default to `None`
68+
- `--tp`: The tensor parallel size for the vLLM backend, default to `1`
69+
- `--trust_remote_code`: Whether to trust the remote code, default to `False`
70+
- `--tokenizer_name`: The name of the customized tokenizer, default to `None`
71+
- `--tokenizer_legacy`: Whether to use the legacy tokenizer, default to `False`
72+
- `--samples`: The path to the generated samples file, default to `None`
73+
- `--local_execute`: Whether to execute the samples locally, default to `False`
74+
- `--remote_execute_api`: The API endpoint for remote execution, default to `https://bigcode-bigcodebench-evaluator.hf.space/`, you can also use your own Gradio API endpoint by cloning the [bigcodebench-evaluator](https://github.com/bigcode-project/bigcodebench-evaluator) repo and check `Use via API` at the bottom of the HF space page.
75+
- `--pass_k`: The `k` in `Pass@k`, default to `[1, 5, 10]`, e.g. `--pass_k 1,5,10` will evaluate `Pass@1`, `Pass@5` and `Pass@10`
76+
- `--save_pass_rate`: Whether to save the pass rate to a file, default to `True`
77+
- `--parallel`: The number of parallel processes, default to `None`, e.g. `--parallel 10` will evaluate 10 samples in parallel
78+
- `--min_time_limit`: The minimum time limit for the execution, default to `1`, e.g. `--min_time_limit 10` will evaluate the samples with at least 10 seconds
79+
- `--max_as_limit`: The maximum address space limit for the execution, default to `30*1024` (30 GB), e.g. `--max_as_limit 20*1024` will evaluate the samples with at most 20 GB
80+
- `--max_data_limit`: The maximum data segment limit for the execution, default to `30*1024` (30 GB), e.g. `--max_data_limit 20*1024` will evaluate the samples with at most 20 GB
81+
- `--max_stack_limit`: The maximum stack limit for the execution, default to `10`, e.g. `--max_stack_limit 20` will evaluate the samples with at most 20 MB
82+
- `--check_gt_only`: Whether to only check the ground truths, default to `False`
83+
- `--no_gt`: Whether to not check the ground truths, default to `False`
84+
85+
## 🚀 Full Script
86+
87+
We provide an example script to run the full pipeline for the remote evaluation:
88+
89+
```bash
90+
bash run.sh
91+
```
92+
93+
## 🚀 Local Generation
4794

4895
```bash
4996
# when greedy, there is no need for temperature and n_samples
@@ -62,9 +109,11 @@ bigcodebench.generate \
62109
[--base_url [base_url]] \
63110
[--tokenizer_name [tokenizer_name]]
64111
```
112+
65113
>
66114
The generated code samples will be stored in a file named `[model_name]--bigcodebench-[instruct|complete]--[backend]-[temp]-[n_samples].jsonl`. Alternatively, you can use the following command to utilize our pre-built docker images for generating code samples:
67115
>
116+
68117
```bash
69118
# If you are using GPUs
70119
docker run --gpus '"device=$CUDA_VISIBLE_DEVICES"' -v $(pwd):/app -t bigcodebench/bigcodebench-generate:latest \
@@ -148,7 +197,7 @@ docker run -it --entrypoint bigcodebench.syncheck -v $(pwd):/app bigcodebench/bi
148197
</details>
149198
150199
151-
### Local Evaluation
200+
## 🚀 Local Evaluation
152201
153202
You are strongly recommended to use a sandbox such as [docker](https://docs.docker.com/get-docker/):
154203
@@ -250,14 +299,6 @@ bigcodebench.inspect --eval_results sample-sanitized-calibrated_eval_results.jso
250299
bigcodebench.inspect --eval_results sample-sanitized-calibrated_eval_results.json --split complete --subset hard --in_place
251300
```
252301
253-
## 🚀 Full Script
254-
255-
We provide a sample script to run the full pipeline:
256-
257-
```bash
258-
bash run.sh
259-
```
260-
261302
## 📊 Result Analysis
262303
263304
We provide a script to replicate the analysis like Elo Rating and Task Solve Rate, which helps you understand the performance of the models further.
@@ -270,12 +311,7 @@ cd analysis
270311
python get_results.py
271312
```
272313
273-
## 💻 LLM-generated Code
274-
275-
We share pre-generated code samples from LLMs we have [evaluated](https://huggingface.co/spaces/bigcode/bigcodebench-leaderboard):
276-
* See the attachment of our [v0.1.5](https://github.com/bigcode-project/bigcodebench/releases/tag/v0.1.5). We include both `sanitized_samples.zip` and `sanitized_samples_calibrated.zip` for your convenience.
277-
278-
## 🐞 Known Issues
314+
## 🐞 Resolved Issues
279315
280316
- [x] Due to [the Hugging Face tokenizer update](https://github.com/huggingface/transformers/pull/31305), some tokenizers may be broken and will degrade the performance of the evaluation. Therefore, we set up with `legacy=False` for the initialization. If you notice the unexpected behaviors, please try `--tokenizer_legacy` during the generation.
281317

README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -100,6 +100,8 @@ bigcodebench.evaluate \
100100
--tp [TENSOR_PARALLEL_SIZE] \
101101
--greedy
102102
```
103+
104+
- All the resulted files will be stored in a folder named `bcb_results`.
103105
- The generated code samples will be stored in a file named `[model_name]--bigcodebench-[instruct|complete]--[backend]-[temp]-[n_samples]-sanitized_calibrated.jsonl`.
104106
- The evaluation results will be stored in a file named `[model_name]--bigcodebench-[instruct|complete]--[backend]-[temp]-[n_samples]-sanitized_calibrated_eval_results.json`.
105107
- The pass@k results will be stored in a file named `[model_name]--bigcodebench-[instruct|complete]--[backend]-[temp]-[n_samples]-sanitized_calibrated_pass_at_k.json`.

0 commit comments

Comments
 (0)