You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We provide a sample script to run the full pipeline:
312
321
313
322
```bash
314
323
bash run.sh
315
324
```
316
325
317
-
## Result Analysis
326
+
## 📊 Result Analysis
318
327
319
328
We provide a script to replicate the analysis like Elo Rating and Task Solve Rate, which helps you understand the performance of the models further.
320
329
@@ -331,7 +340,7 @@ python get_results.py
331
340
We share pre-generated code samples from LLMs we have [evaluated](https://huggingface.co/spaces/bigcode/bigcodebench-leaderboard):
332
341
* See the attachment of our [v0.1.5](https://github.com/bigcode-project/bigcodebench/releases/tag/v0.1.5). We include both `sanitized_samples.zip` and `sanitized_samples_calibrated.zip` for your convenience.
333
342
334
-
## Known Issues
343
+
## 🐞 Known Issues
335
344
336
345
- [ ] Due to the flakes in the evaluation, the execution results may vary slightly (~0.2%) between runs. We are working on improving the evaluation stability.
337
346
@@ -343,10 +352,10 @@ We share pre-generated code samples from LLMs we have [evaluated](https://huggin
343
352
344
353
```bibtex
345
354
@article{zhuo2024bigcodebench,
346
-
title={BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions},
347
-
author={Terry Yue Zhuo and Minh Chien Vu and Jenny Chimand Han Hu and Wenhao Yu and Ratnadira Widyasariand Imam Nur Bani Yusuf and Haolan Zhanand Junda He and Indraneil Paul and Simon Brunner and Chen Gong and Thong Hoang and Armel Randy Zebaze and Xiaoheng Hong and Wen-Ding Li and Jean Kaddour and Ming Xu and Zhihan Zhang and Prateek Yadav and Naman Jain and Alex Gu and Zhoujun Cheng and Jiawei Liu and Qian Liu and Zijian Wang and David Lo and Binyuan Hui and Niklas Muennighoff and Daniel Fried and Xiaoning Du and Harm de Vries and Leandro Von Werra},
348
-
journal={arXiv preprint arXiv:2406.15877},
349
-
year={2024}
355
+
title={BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions},
356
+
author={Zhuo, Terry Yue and Vu, Minh Chien and Chim, Jenny and Hu, Han and Yu, Wenhao and Widyasari, Ratnadira and Yusuf, Imam Nur Bani and Zhan, Haolan and He, Junda and Paul, Indraneil and others},
0 commit comments