@@ -225,11 +225,10 @@ You are strongly recommended to use a sandbox such as [docker](https://docs.dock
225225
226226` ` ` bash
227227# Mount the current directory to the container
228+ # If you want to change the RAM address space limit (in MB, 128 GB by default): `--max-as-limit XXX`
229+ # If you want to change the RAM data segment limit (in MB, 4 GB by default): `--max-data-limit`
230+ # If you want to change the RAM stack limit (in MB, 4 MB by default): `--max-stack-limit`
228231docker run -v $( pwd) :/app bigcodebench/bigcodebench-evaluate:latest --subset [complete| instruct] --samples samples-sanitized-calibrated
229- # ...Or locally ⚠️
230- bigcodebench.evaluate --subset [complete| instruct] --samples samples-sanitized-calibrated
231- # ...If the ground truth is working locally (due to some flaky tests)
232- bigcodebench.evaluate --subset [complete| instruct] --samples samples-sanitized-calibrated --no-gt
233232` ` `
234233
235234...Or if you want to try it locally regardless of the risks ⚠️:
@@ -245,7 +244,7 @@ Then, run the evaluation:
245244` ` ` bash
246245# ...Or locally ⚠️
247246bigcodebench.evaluate --subset [complete| instruct] --samples samples-sanitized-calibrated.jsonl
248- # ...If the ground truth is not working locally
247+ # ...If you really don't want to check the ground truths
249248bigcodebench.evaluate --subset [complete| instruct] --samples samples-sanitized-calibrated --no-gt
250249` ` `
251250
@@ -276,8 +275,9 @@ Reading samples...
2762751140it [00:00, 1901.64it/s]
277276Evaluating samples...
278277100%|██████████████████████████████████████████| 1140/1140 [19:53<00:00, 6.75it/s]
279- bigcodebench
280- {' pass@1' : 0.568}
278+ BigCodeBench-instruct-calibrated
279+ Groundtruth pass rate: 1.000
280+ pass@1: 0.568
281281```
282282
283283- The "k" includes `[1, 5, 10]` where k values `<=` the sample size will be used
0 commit comments