-
Notifications
You must be signed in to change notification settings - Fork 5
WIP: Refactor Course Lab Benchmark #27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
tareknaser
wants to merge
9
commits into
main
Choose a base branch
from
refactor_course_lab_bench
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Changes from all commits
Commits
Show all changes
9 commits
Select commit
Hold shift + click to select a range
1e84d46
wip: course lab benchmark rework
tareknaser ace20cc
docs(course_lab_bench): update task instructions to include course me…
tareknaser 4acb95f
docs(pyproject.toml): update author information
tareknaser f6c9f23
fix(docker): go PATH for login shells in Docker environment
tareknaser de6959a
feat(executor): retry mechanism for evaluation script to handle flaky…
tareknaser 6678826
docs(courselab_bench): add a note on previous labs reference implemen…
tareknaser 7d7f696
feat(courselab_bench): modify system prompt to emphasize focus on cur…
tareknaser b7e9b5e
feat(courselab_bench): add config option to add starter files
tareknaser fa1fb00
feat(courselab_bench): add validation for starter and output files in…
tareknaser File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,17 @@ | ||
| # LLM API Keys Configuration | ||
| # Copy this file to .env.toml and fill in your API keys | ||
| # LiteLLM will automatically use these environment variables | ||
|
|
||
|
|
||
| # OpenAI | ||
| # OPENAI_API_KEY = "sk-..." | ||
| # OPENAI_BASE_URL = "https://api.openai.com/v1" # Optional: custom endpoint | ||
|
|
||
| # Anthropic | ||
| # ANTHROPIC_API_KEY = "sk-ant-..." | ||
|
|
||
| # Azure OpenAI | ||
| # AZURE_API_KEY = "..." | ||
| # AZURE_API_BASE = "https://YOUR_RESOURCE.openai.azure.com" | ||
| # AZURE_API_VERSION = "2024-02-15-preview" | ||
|
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,43 @@ | ||
| # Python | ||
| __pycache__/ | ||
| *.py[cod] | ||
| *$py.class | ||
| *.so | ||
| .Python | ||
| *.egg-info/ | ||
| dist/ | ||
| build/ | ||
| .eggs/ | ||
|
|
||
| # Virtual environments | ||
| .venv/ | ||
| venv/ | ||
| ENV/ | ||
| env/ | ||
|
|
||
| # Testing | ||
| .pytest_cache/ | ||
| .coverage | ||
| htmlcov/ | ||
| *.cover | ||
|
|
||
| # IDE | ||
| .vscode/ | ||
| .idea/ | ||
| *.swp | ||
| *.swo | ||
| *~ | ||
|
|
||
| # Outputs (don't commit results) | ||
| outputs/ | ||
| *.log | ||
|
|
||
| # Secrets (don't commit API keys) | ||
| configs/*secret*.yaml | ||
| .env | ||
| .env.toml | ||
|
|
||
| # OS | ||
| .DS_Store | ||
| Thumbs.db | ||
| data/tasks.jsonl |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,127 @@ | ||
| # Course Lab Benchmark | ||
|
|
||
| A benchmark for evaluating AI agents on systems programming labs. Agents run in Docker containers and are evaluated on their ability to complete course lab assignments. | ||
|
|
||
| We include a simple ReAct agent inspired by [mini-swe-agent](https://github.com/AUTOMATIC/mini-swe-agent). | ||
|
|
||
| ## Quick Start | ||
|
|
||
| Make sure to export the appropriate API keys for your chosen model provider (copy `.env.toml.example` to `.env.toml` and fill in your keys). We use litellm for model access. | ||
|
|
||
| ```bash | ||
| pip install -e . | ||
|
|
||
| # Prepare dataset (This will generate data/tasks.jsonl using the tasks in data/) | ||
| python prepare_dataset.py | ||
|
|
||
| # Run all tasks | ||
| python run_benchmark.py | ||
| ``` | ||
|
|
||
| ## Usage | ||
|
|
||
| ```bash | ||
| python run_benchmark.py \ | ||
| --tasks data/tasks.jsonl \ | ||
| --model anthropic/claude-sonnet-4-5-20250929 \ | ||
| --max-steps 50 \ | ||
| --max-cost 20.0 | ||
| ``` | ||
|
|
||
| ## Output | ||
|
|
||
| Each run creates a directory with a single `results.json` file: | ||
|
|
||
| ```json | ||
| { | ||
| "config": { "model": "...", "max_steps": 50, ... }, | ||
| "summary": { | ||
| "total": 10, | ||
| "passed": 8, | ||
| "success_rate": 0.8, | ||
| "total_cost": 0.234, | ||
| "by_course": { "mit_6_5840_2024": { "total": 10, "passed": 8, ... } } | ||
| }, | ||
| "results": [ | ||
| { | ||
| "instance_id": "test__simple__echo", | ||
| "passed": true, | ||
| "agent_status": "completed", | ||
| "test_output": "PASS: ...", | ||
| "test_exit_code": 0, | ||
| "duration_seconds": 12.5, | ||
| "model_cost": 0.0033 | ||
| } | ||
| ] | ||
| } | ||
| ``` | ||
|
|
||
| Detailed agent trajectories are saved in `trajectories/{instance_id}.jsonl`. | ||
|
|
||
| ## Task Structure | ||
|
|
||
| Tasks are organized in a folder hierarchy: | ||
|
|
||
| ``` | ||
| data/ | ||
| └── course_id/ | ||
| └── task_id/ | ||
| ├── config.json # Task metadata | ||
| ├── task.md # Problem statement | ||
| ├── preprocess.sh # Setup script (runs before agent) | ||
| ├── evaluate.sh # Evaluation script (determines pass/fail) | ||
| └── starter_files/ # Optional: files to copy to container | ||
| └── ... | ||
| ``` | ||
|
|
||
| ### config.json | ||
|
|
||
| Required fields: | ||
|
|
||
| - `instance_id`: Unique identifier (e.g., `"test__simple__echo"`) | ||
| - `course_id`: Course identifier (e.g., `"test_course"`) | ||
| - `docker_image`: Docker image to use (e.g., `"xuafeng/swe-go-python:latest"`) | ||
|
|
||
| Optional fields: | ||
|
|
||
| - `timeout_minutes`: Maximum execution time (default: 30) | ||
| - `tags`: List of topic tags | ||
| - `repo_url`: Git repository to clone | ||
| - `base_commit`: Git commit to checkout | ||
| - `starter_files`: List of files to copy from `starter_files/` directory to container (`src` is relative to `starter_files/`, `dest` is absolute path in container) | ||
| - `output_files`: List of files to copy from container to output directory after agent completes (`src` is absolute path in container, `dest` is relative to output directory) | ||
|
|
||
| ### task.md | ||
|
|
||
| Markdown file containing the problem statement given to the agent. | ||
|
|
||
| ### preprocess.sh | ||
|
|
||
| Shell script that runs before the agent starts. Use this to: | ||
|
|
||
| - Set up the environment | ||
| - Create checksums of files that shouldn't be modified | ||
|
|
||
| Exit with code 0 on success, non-zero on failure. | ||
|
|
||
| ### evaluate.sh | ||
|
|
||
| Runs after the agent completes. Exit 0 for PASS, non-zero for FAIL. | ||
| Print verbose output for debugging (captured in results). | ||
|
|
||
| > The evaluation script is automatically retried up to 3 times or until a successful evaluation. This helps handle flaky tests or non-deterministic timeouts common in some systems programming labs. | ||
|
|
||
| ### Example Task | ||
|
|
||
| See `data/test_course/test__simple__echo/` for a minimal example, or `data/mit_6_5840_2024/4a_kvraft/` for an example using `starter_files` and `output_files`. | ||
|
|
||
| ## Adding New Tasks | ||
|
|
||
| 1. If you are adding tasks for a new course, first add a new entry to [`/data/courses.json`](./data/courses.json) with the course metadata | ||
| 2. Create a new folder: `data/{course_id}/{task_id}/` (where `{course_id}` matches the entry in `courses.json`) | ||
| 3. Add the 4 required files: `config.json`, `task.md`, `preprocess.sh`, `evaluate.sh` for each task | ||
| 4. (Optional) Create a `starter_files/` directory and add files that should be copied to the container | ||
| 5. (Optional) Configure `starter_files` and `output_files` in `config.json` | ||
| 6. Make scripts executable | ||
| 7. Run `python prepare_dataset.py` to regenerate `tasks.jsonl` | ||
| 8. Run the benchmark | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,19 @@ | ||
| __version__ = "0.1.0" | ||
|
|
||
| from courselab_bench.agent import REACTAgent | ||
| from courselab_bench.environment import DockerEnvironment | ||
| from courselab_bench.model import LiteLLMModel | ||
| from courselab_bench.data import load_tasks | ||
| from courselab_bench.runner import execute_task, save_trajectory | ||
| from courselab_bench.evaluation import evaluate_task, compute_summary | ||
|
|
||
| __all__ = [ | ||
| "REACTAgent", | ||
| "DockerEnvironment", | ||
| "LiteLLMModel", | ||
| "load_tasks", | ||
| "execute_task", | ||
| "save_trajectory", | ||
| "evaluate_task", | ||
| "compute_summary", | ||
| ] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,3 @@ | ||
| from courselab_bench.agent.react import REACTAgent | ||
|
|
||
| __all__ = ["REACTAgent"] |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.