Skip to content

Commit 8a4b402

Browse files
authored
fix: update v0.1.8
Update on v0.1.8
2 parents 5b67995 + a47680d commit 8a4b402

File tree

6 files changed

+34
-39
lines changed

6 files changed

+34
-39
lines changed

README.md

Lines changed: 15 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -104,18 +104,21 @@ pip install -U flash-attn
104104
To generate code samples from a model, you can use the following command:
105105
>
106106
```bash
107+
# when greedy, there is no need for temperature and n_samples
107108
bigcodebench.generate \
108109
--model [model_name] \
109-
--subset [complete|instruct] \
110-
--greedy \
110+
--split [complete|instruct] \
111+
--subset [full|hard] \
112+
[--greedy] \
111113
--bs [bs] \
112114
--temperature [temp] \
113115
--n_samples [n_samples] \
114116
--resume \
115117
--backend [vllm|hf|openai|mistral|anthropic|google] \
116118
--tp [gpu_number] \
117119
[--trust_remote_code] \
118-
[--base_url [base_url]]
120+
[--base_url [base_url]] \
121+
[--tokenizer_name [tokenizer_name]]
119122
```
120123
>
121124
The generated code samples will be stored in a file named `[model_name]--bigcodebench-[instruct|complete]--[backend]-[temp]-[n_samples].jsonl`. Alternatively, you can use the following command to utilize our pre-built docker images for generating code samples:
@@ -124,7 +127,8 @@ The generated code samples will be stored in a file named `[model_name]--bigcode
124127
# If you are using GPUs
125128
docker run --gpus '"device=$CUDA_VISIBLE_DEVICES"' -v $(pwd):/app -t bigcodebench/bigcodebench-generate:latest \
126129
--model [model_name] \
127-
--subset [complete|instruct] \
130+
--split [complete|instruct] \
131+
--subset [full|hard] \
128132
[--greedy] \
129133
--bs [bs] \
130134
--temperature [temp] \
@@ -136,7 +140,8 @@ docker run --gpus '"device=$CUDA_VISIBLE_DEVICES"' -v $(pwd):/app -t bigcodebenc
136140
# ...Or if you are using CPUs
137141
docker run -v $(pwd):/app -t bigcodebench/bigcodebench-generate:latest \
138142
--model [model_name] \
139-
--subset [complete|instruct] \
143+
--split [complete|instruct] \
144+
--subset [full|hard] \
140145
[--greedy] \
141146
--bs [bs] \
142147
--temperature [temp] \
@@ -233,10 +238,10 @@ You are strongly recommended to use a sandbox such as [docker](https://docs.dock
233238
# If you want to change the RAM address space limit (in MB, 128 GB by default): `--max-as-limit XXX`
234239
# If you want to change the RAM data segment limit (in MB, 4 GB by default): `--max-data-limit`
235240
# If you want to change the RAM stack limit (in MB, 4 MB by default): `--max-stack-limit`
236-
docker run -v $(pwd):/app bigcodebench/bigcodebench-evaluate:latest --subset [complete|instruct] --samples samples-sanitized-calibrated.jsonl
241+
docker run -v $(pwd):/app bigcodebench/bigcodebench-evaluate:latest --split [complete|instruct] --subset [full|hard] --samples samples-sanitized-calibrated.jsonl
237242

238243
# If you only want to check the ground truths
239-
docker run -v $(pwd):/app bigcodebench/bigcodebench-evaluate:latest --subset [complete|instruct] --samples samples-sanitized-calibrated.jsonl --check-gt-only
244+
docker run -v $(pwd):/app bigcodebench/bigcodebench-evaluate:latest --split [complete|instruct] --subset [full|hard] --samples samples-sanitized-calibrated.jsonl --check-gt-only
240245
```
241246
242247
...Or if you want to try it locally regardless of the risks ⚠️:
@@ -251,12 +256,12 @@ Then, run the evaluation:
251256
252257
```bash
253258
# ...Or locally ⚠️
254-
bigcodebench.evaluate --subset [complete|instruct] --samples samples-sanitized-calibrated.jsonl
259+
bigcodebench.evaluate --split [complete|instruct] --subset [full|hard] --samples samples-sanitized-calibrated.jsonl
255260
# ...If you really don't want to check the ground truths
256-
bigcodebench.evaluate --subset [complete|instruct] --samples samples-sanitized-calibrated.jsonl --no-gt
261+
bigcodebench.evaluate --split [complete|instruct] --subset [full|hard] --samples samples-sanitized-calibrated.jsonl --no-gt
257262
258263
# You are strongly recommended to use the following command to clean up the environment after evaluation:
259-
pids=$(ps -u $(id -u) -o pid,comm | grep '^ *[0-9]\\+ bigcodebench' | awk '{print $1}'); if [ -n \"$pids\" ]; then echo $pids | xargs -r kill; fi;
264+
pids=$(ps -u $(id -u) -o pid,comm | grep 'bigcodebench' | awk '{print $1}'); if [ -n \"$pids\" ]; then echo $pids | xargs -r kill; fi;
260265
rm -rf /tmp/*
261266
```
262267

analysis/utils.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -640,7 +640,7 @@
640640
},
641641
"deepseek-coder": {
642642
"name": "DeepSeek-Coder-V2-Instruct",
643-
"link": "https://www.deepseek.com/",
643+
"link": "https://huggingface.co/deepseek-ai/DeepSeek-Coder-V2-Instruct",
644644
"prompted": True,
645645
"moe": True,
646646
"size": 236,

bigcodebench/generate.py

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -123,8 +123,7 @@ def main():
123123

124124
args = parser.parse_args()
125125

126-
if args.greedy and (args.temperature != 0 or args.bs != 1 or args.n_samples != 1)\
127-
or (args.temperature == 0 and args.n_samples == 1):
126+
if args.greedy or (args.temperature == 0 and args.n_samples == 1):
128127
args.temperature = 0
129128
args.bs = 1
130129
args.n_samples = 1

bigcodebench/model.py

Lines changed: 5 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,6 @@
2626
warn("GoogleGenAI decoder will not work. Fix by `pip install google-generativeai`")
2727

2828
import torch
29-
from stop_sequencer import StopSequencer
3029
from transformers import AutoModelForCausalLM, AutoTokenizer
3130

3231
try:
@@ -137,7 +136,8 @@ def __init__(self, name: str, dataset: str, tp: int, **kwargs) -> None:
137136
self.tokenizer = AutoTokenizer.from_pretrained(self.tokenizer_name, **kwargs)
138137
if self.tokenizer.chat_template is None:
139138
self.eos += extra_eos_for_direct_completion(dataset)
140-
self.llm = LLM(model=name, max_model_len=2048, tokenizer=self.tokenizer_name, **kwargs)
139+
self.llm = LLM(model=name, max_model_len=2048, **kwargs)
140+
self.llm.set_tokenizer(tokenizer=self.tokenizer)
141141

142142
def is_direct_completion(self) -> bool:
143143
return self.tokenizer.chat_template is None
@@ -190,11 +190,11 @@ def __init__(self, name: str, dataset: str, **kwargs):
190190
self.skip_special_tokens = True
191191

192192
print(f"{kwargs = }", self.tokenizer_name)
193-
194193
if self.tokenizer_name is None:
195194
self.tokenizer_name = self.name
196-
195+
197196
self.tokenizer = AutoTokenizer.from_pretrained(self.tokenizer_name, **kwargs)
197+
198198
if self.tokenizer.chat_template is None:
199199
self.eos += extra_eos_for_direct_completion(dataset)
200200

@@ -220,18 +220,7 @@ def codegen(
220220
kwargs["top_p"] = 0.95
221221
kwargs["temperature"] = self.temperature
222222

223-
stop_sequencer = StopSequencer(
224-
self.model,
225-
model_type="causal", # or seq2seq
226-
tokenizer=self.tokenizer,
227-
)
228-
229-
model = stop_sequencer.register_stop_texts(
230-
stop_texts=self.eos,
231-
input_length=input_tokens.size(-1),
232-
)
233-
234-
outputs = model.generate(
223+
outputs = self.model.generate(
235224
input_tokens,
236225
max_new_tokens=self.max_new_tokens,
237226
do_sample=do_sample,

run.sh

Lines changed: 12 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,8 @@ BACKEND=openai
55
TEMP=0
66
N_SAMPLES=1
77
NUM_GPU=1
8-
SUBSET=instruct
8+
SPLIT=complete
9+
SUBSET=hard
910
if [[ $MODEL == *"/"* ]]; then
1011
ORG=$(echo $MODEL | cut -d'/' -f1)--
1112
BASE_MODEL=$(echo $MODEL | cut -d'/' -f2)
@@ -14,24 +15,26 @@ else
1415
BASE_MODEL=$MODEL
1516
fi
1617

17-
FILE_HEADER=$ORG$BASE_MODEL--$DATASET-$SUBSET--$BACKEND-$TEMP-$N_SAMPLES
18+
if [ "$SUBSET" = "full" ]; then
19+
FILE_HEADER="${ORG}${BASE_MODEL}--${DATASET}-${SPLIT}--${BACKEND}-${TEMP}-${N_SAMPLES}"
20+
else
21+
FILE_HEADER="${ORG}${BASE_MODEL}--${DATASET}-${SUBSET}-${SPLIT}--${BACKEND}-${TEMP}-${N_SAMPLES}"
22+
fi
1823

1924
echo $FILE_HEADER
2025
bigcodebench.generate \
21-
--id_range 0 1 \
2226
--tp $NUM_GPU \
2327
--model $MODEL \
24-
--bs $BS \
25-
--temperature $TEMP \
26-
--n_samples $N_SAMPLES \
2728
--resume \
29+
--split $SPLIT \
2830
--subset $SUBSET \
29-
--backend $BACKEND
31+
--backend $BACKEND \
32+
--greedy
3033

3134
bigcodebench.sanitize --samples $FILE_HEADER.jsonl --calibrate
3235

3336
# Check if the ground truth works on your machine
34-
bigcodebench.evaluate --subset $SUBSET --samples $FILE_HEADER-sanitized-calibrated.jsonl
37+
bigcodebench.evaluate --split $SPLIT --subset $SUBSET --samples $FILE_HEADER-sanitized-calibrated.jsonl
3538

3639
# If the execution is slow:
37-
bigcodebench.evaluate --subset $SUBSET --samples $FILE_HEADER-sanitized-calibrated.jsonl --parallel 32
40+
bigcodebench.evaluate --split $SPLIT --subset $SUBSET --samples $FILE_HEADER-sanitized-calibrated.jsonl --parallel 32

setup.cfg

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,6 @@ generate =
3737
anthropic>=0.26.1
3838
google-generativeai>=0.5.4
3939
mistralai>=0.2.0
40-
stop-sequencer>=1.2.3
4140
openai>=1.11.1
4241

4342
[options.entry_points]

0 commit comments

Comments
 (0)