Skip to content

Commit d1cb2fe

Browse files
committed
Merge branch 'main' of https://github.com/bigcode-project/bigcodebench into marianna
2 parents 96aafc0 + 19ca466 commit d1cb2fe

File tree

6 files changed

+106
-807
lines changed

6 files changed

+106
-807
lines changed

Docker/Evaluate.Dockerfile

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -31,4 +31,6 @@ RUN chmod -R 777 /app
3131

3232
USER bigcodebenchuser
3333

34-
ENTRYPOINT ["python3", "-m", "bigcodebench.evaluate"]
34+
ENTRYPOINT ["python3", "-m", "bigcodebench.evaluate"]
35+
36+
CMD ["sh", "-c", "pids=$(ps -u $(id -u) -o pid,comm | grep 'bigcodebench' | awk '{print $1}'); if [ -n \"$pids\" ]; then echo $pids | xargs -r kill; fi; rm -rf /tmp/*"]

README.md

Lines changed: 22 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -4,25 +4,30 @@
44
</center>
55

66
<p align="center">
7+
<a href="https://huggingface.co/spaces/bigcode/bigcodebench-leaderboard"><img src="https://img.shields.io/badge/🤗&nbsp&nbsp%F0%9F%8F%86-leaderboard-%23ff8811"></a>
8+
<a href="https://huggingface.co/collections/bigcode/bigcodebench-666ed21a5039c618e608ab06"><img src="https://img.shields.io/badge/🤗-collection-pink"></a>
9+
<a href="https://bigcode-bench.github.io/"><img src="https://img.shields.io/badge/%F0%9F%8F%86-website-8A2BE2"></a>
10+
<a href="https://arxiv.org/abs/2406.15877"><img src="https://img.shields.io/badge/arXiv-2406.15877-b31b1b.svg"></a>
711
<a href="https://pypi.org/project/bigcodebench/"><img src="https://img.shields.io/pypi/v/bigcodebench?color=g"></a>
12+
<a href="https://pepy.tech/project/bigcodebench"><img src="https://static.pepy.tech/badge/bigcodebench"></a>
13+
<a href="https://github.com/bigcodebench/bigcodebench/blob/master/LICENSE"><img src="https://img.shields.io/pypi/l/bigcodebench"></a>
814
<a href="https://hub.docker.com/r/bigcodebench/bigcodebench-evaluate" title="Docker-Eval"><img src="https://img.shields.io/docker/image-size/bigcodebench/bigcodebench-evaluate"></a>
915
<a href="https://hub.docker.com/r/bigcodebench/bigcodebench-generate" title="Docker-Gen"><img src="https://img.shields.io/docker/image-size/bigcodebench/bigcodebench-generate"></a>
10-
<a href="https://github.com/bigcodebench/bigcodebench/blob/master/LICENSE"><img src="https://img.shields.io/pypi/l/bigcodebench"></a>
1116
</p>
1217

1318
<p align="center">
1419
<a href="#-about">🌸About</a> •
1520
<a href="#-quick-start">🔥Quick Start</a> •
16-
<a href="#-llm-generated-code">💻LLM code</a> •
17-
<a href="#-failure-inspection">🔍Failure inspection</a> •
21+
<a href="#-failure-inspection">🔍Failure Inspection</a> •
1822
<a href="#-full-script">🚀Full Script</a> •
1923
<a href="#-result-analysis">📊Result Analysis</a> •
20-
<a href="#-known-issues">🐞Known issues</a> •
24+
<a href="#-llm-generated-code">💻LLM-generated Code</a> •
25+
<a href="#-known-issues">🐞Known Issues</a> •
2126
<a href="#-citation">📜Citation</a> •
2227
<a href="#-acknowledgement">🙏Acknowledgement</a>
2328
</p>
2429

25-
## About
30+
## 🌸 About
2631

2732
### BigCodeBench
2833

@@ -249,6 +254,10 @@ Then, run the evaluation:
249254
bigcodebench.evaluate --subset [complete|instruct] --samples samples-sanitized-calibrated.jsonl
250255
# ...If you really don't want to check the ground truths
251256
bigcodebench.evaluate --subset [complete|instruct] --samples samples-sanitized-calibrated.jsonl --no-gt
257+
258+
# You are strongly recommended to use the following command to clean up the environment after evaluation:
259+
pids=$(ps -u $(id -u) -o pid,comm | grep '^ *[0-9]\\+ bigcodebench' | awk '{print $1}'); if [ -n \"$pids\" ]; then echo $pids | xargs -r kill; fi;
260+
rm -rf /tmp/*
252261
```
253262
254263
> [!Tip]
@@ -298,23 +307,23 @@ Here are some tips to speed up the evaluation:
298307
</div>
299308
</details>
300309
301-
## Failure Inspection
310+
## 🔍 Failure Inspection
302311
303312
You can inspect the failed samples by using the following command:
304313
305314
```bash
306315
bigcodebench.inspect --eval-results sample-sanitized-calibrated_eval_results.json --in-place
307316
```
308317
309-
## Full Script
318+
## 🚀 Full Script
310319
311320
We provide a sample script to run the full pipeline:
312321
313322
```bash
314323
bash run.sh
315324
```
316325
317-
## Result Analysis
326+
## 📊 Result Analysis
318327
319328
We provide a script to replicate the analysis like Elo Rating and Task Solve Rate, which helps you understand the performance of the models further.
320329
@@ -331,7 +340,7 @@ python get_results.py
331340
We share pre-generated code samples from LLMs we have [evaluated](https://huggingface.co/spaces/bigcode/bigcodebench-leaderboard):
332341
* See the attachment of our [v0.1.5](https://github.com/bigcode-project/bigcodebench/releases/tag/v0.1.5). We include both `sanitized_samples.zip` and `sanitized_samples_calibrated.zip` for your convenience.
333342
334-
## Known Issues
343+
## 🐞 Known Issues
335344
336345
- [ ] Due to the flakes in the evaluation, the execution results may vary slightly (~0.2%) between runs. We are working on improving the evaluation stability.
337346
@@ -343,10 +352,10 @@ We share pre-generated code samples from LLMs we have [evaluated](https://huggin
343352
344353
```bibtex
345354
@article{zhuo2024bigcodebench,
346-
title={BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions},
347-
author={Terry Yue Zhuo and Minh Chien Vu and Jenny Chim and Han Hu and Wenhao Yu and Ratnadira Widyasari and Imam Nur Bani Yusuf and Haolan Zhan and Junda He and Indraneil Paul and Simon Brunner and Chen Gong and Thong Hoang and Armel Randy Zebaze and Xiaoheng Hong and Wen-Ding Li and Jean Kaddour and Ming Xu and Zhihan Zhang and Prateek Yadav and Naman Jain and Alex Gu and Zhoujun Cheng and Jiawei Liu and Qian Liu and Zijian Wang and David Lo and Binyuan Hui and Niklas Muennighoff and Daniel Fried and Xiaoning Du and Harm de Vries and Leandro Von Werra},
348-
journal={arXiv preprint arXiv:2406.15877},
349-
year={2024}
355+
title={BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions},
356+
author={Zhuo, Terry Yue and Vu, Minh Chien and Chim, Jenny and Hu, Han and Yu, Wenhao and Widyasari, Ratnadira and Yusuf, Imam Nur Bani and Zhan, Haolan and He, Junda and Paul, Indraneil and others},
357+
journal={arXiv preprint arXiv:2406.15877},
358+
year={2024}
350359
}
351360
```
352361

analysis/get_results.py

Lines changed: 13 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -144,6 +144,7 @@ def split_gen():
144144

145145
def read_task_perf(task="complete"):
146146
model_results = dict()
147+
result_files = []
147148
for model, info in model_info.items():
148149
if task == "instruct" and (not info["prompted"] or info["name"] in ["Granite-Code-3B-Instruct", "Granite-Code-8B-Instruct"]):
149150
continue
@@ -164,13 +165,14 @@ def read_task_perf(task="complete"):
164165
except:
165166
continue
166167

168+
result_files.append(file)
167169
with open(file, "r") as f:
168170
data = json.load(f)
169171
for task_id, perfs in data["eval"].items():
170172
status = 1 if perfs[0]["status"] == "pass" else 0
171173
task_perf[task_id] = status
172174
model_results[info["name"]] = task_perf
173-
return model_results
175+
return model_results, result_files
174176

175177

176178
def get_winner_df(data_dict, task, task_level=True, no_tie=True):
@@ -267,9 +269,6 @@ def get_solve_rate(data_dict, task="complete"):
267269
for task_id in range(1140):
268270
task_solve_count[f"BigCodeBench/{task_id}"].append(task_perf[f"BigCodeBench/{task_id}"])
269271
solve_rate = {task_id: round(np.mean(perfs) * 100, 1) for task_id, perfs in task_solve_count.items()}
270-
with open(f"{task}_solve_rate.txt", "w") as f:
271-
f.write(f"Number of unsolved tasks: {sum([1 for task_id, solve_rate in solve_rate.items() if solve_rate == 0])}\n")
272-
f.write(f"Number of fully solved tasks: {sum([1 for task_id, solve_rate in solve_rate.items() if solve_rate == 100])}\n")
273272
return Dataset.from_dict({"task_id": list(solve_rate.keys()), "solve_rate": list(solve_rate.values())})
274273

275274

@@ -313,8 +312,16 @@ def push_ds(ds, path, local=False):
313312

314313
model_info = update_model_info(model_info)
315314
results = get_results()
316-
complete_data = read_task_perf("complete")
317-
instruct_data = read_task_perf("instruct")
315+
files = []
316+
complete_data, complete_files = read_task_perf("complete")
317+
instruct_data, instruct_files = read_task_perf("instruct")
318+
files.extend(complete_files)
319+
files.extend(instruct_files)
320+
shutil.rmtree("eval_results", ignore_errors=True)
321+
os.makedirs("eval_results", exist_ok=True)
322+
for file in files:
323+
shutil.copy(file, "eval_results")
324+
318325
complete_solve_rate = get_solve_rate(complete_data, task="complete")
319326
instruct_solve_rate = get_solve_rate(instruct_data, task="instruct")
320327
solve_rate_ds = DatasetDict({"complete": complete_solve_rate, "instruct": instruct_solve_rate})

0 commit comments

Comments
 (0)