Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ jobs:
benchmark:
- example_bench
- course_exam_bench
- courselab_bench
# TODO: For now, we comment out other benchmarks as they have no tests
# - arteval_bench
# - cache_bench
Expand Down
8 changes: 5 additions & 3 deletions benchmarks/course_lab_bench/go-python.Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -25,11 +25,13 @@ RUN apt-get update && apt-get install -y wget tar git build-essential \

ENV PATH="/usr/local/go/bin:${PATH}"

RUN python --version && go version

SHELL ["/bin/bash", "-c"]
# This is where pipx installs things
ENV PATH="$PATH:/root/.local/bin/"
ENV PATH="$PATH:/root/.local/bin/"

# Write PATH to profile files so it's available in login shells (bash -lc)
RUN echo 'export PATH="/usr/local/go/bin:/root/.local/bin:$PATH"' >> /etc/profile && \
echo 'export PATH="/usr/local/go/bin:/root/.local/bin:$PATH"' >> /root/.bashrc

RUN python --version && go version

Expand Down
17 changes: 17 additions & 0 deletions benchmarks/courselab_bench/.env.toml.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# LLM API Keys Configuration
# Copy this file to .env.toml and fill in your API keys
# LiteLLM will automatically use these environment variables


# OpenAI
# OPENAI_API_KEY = "sk-..."
# OPENAI_BASE_URL = "https://api.openai.com/v1" # Optional: custom endpoint

# Anthropic
# ANTHROPIC_API_KEY = "sk-ant-..."

# Azure OpenAI
# AZURE_API_KEY = "..."
# AZURE_API_BASE = "https://YOUR_RESOURCE.openai.azure.com"
# AZURE_API_VERSION = "2024-02-15-preview"

43 changes: 43 additions & 0 deletions benchmarks/courselab_bench/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# Python
__pycache__/
*.py[cod]
*$py.class
*.so
.Python
*.egg-info/
dist/
build/
.eggs/

# Virtual environments
.venv/
venv/
ENV/
env/

# Testing
.pytest_cache/
.coverage
htmlcov/
*.cover

# IDE
.vscode/
.idea/
*.swp
*.swo
*~

# Outputs (don't commit results)
outputs/
*.log

# Secrets (don't commit API keys)
configs/*secret*.yaml
.env
.env.toml

# OS
.DS_Store
Thumbs.db
data/tasks.jsonl
127 changes: 127 additions & 0 deletions benchmarks/courselab_bench/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
# Course Lab Benchmark

A benchmark for evaluating AI agents on systems programming labs. Agents run in Docker containers and are evaluated on their ability to complete course lab assignments.

We include a simple ReAct agent inspired by [mini-swe-agent](https://github.com/AUTOMATIC/mini-swe-agent).

## Quick Start

Make sure to export the appropriate API keys for your chosen model provider (copy `.env.toml.example` to `.env.toml` and fill in your keys). We use litellm for model access.

```bash
pip install -e .

# Prepare dataset (This will generate data/tasks.jsonl using the tasks in data/)
python prepare_dataset.py

# Run all tasks
python run_benchmark.py
```

## Usage

```bash
python run_benchmark.py \
--tasks data/tasks.jsonl \
--model anthropic/claude-sonnet-4-5-20250929 \
--max-steps 50 \
--max-cost 20.0
```

## Output

Each run creates a directory with a single `results.json` file:

```json
{
"config": { "model": "...", "max_steps": 50, ... },
"summary": {
"total": 10,
"passed": 8,
"success_rate": 0.8,
"total_cost": 0.234,
"by_course": { "mit_6_5840_2024": { "total": 10, "passed": 8, ... } }
},
"results": [
{
"instance_id": "test__simple__echo",
"passed": true,
"agent_status": "completed",
"test_output": "PASS: ...",
"test_exit_code": 0,
"duration_seconds": 12.5,
"model_cost": 0.0033
}
]
}
```

Detailed agent trajectories are saved in `trajectories/{instance_id}.jsonl`.

## Task Structure

Tasks are organized in a folder hierarchy:

```
data/
└── course_id/
└── task_id/
├── config.json # Task metadata
├── task.md # Problem statement
├── preprocess.sh # Setup script (runs before agent)
├── evaluate.sh # Evaluation script (determines pass/fail)
└── starter_files/ # Optional: files to copy to container
└── ...
```

### config.json

Required fields:

- `instance_id`: Unique identifier (e.g., `"test__simple__echo"`)
- `course_id`: Course identifier (e.g., `"test_course"`)
- `docker_image`: Docker image to use (e.g., `"xuafeng/swe-go-python:latest"`)

Optional fields:

- `timeout_minutes`: Maximum execution time (default: 30)
- `tags`: List of topic tags
- `repo_url`: Git repository to clone
- `base_commit`: Git commit to checkout
- `starter_files`: List of files to copy from `starter_files/` directory to container (`src` is relative to `starter_files/`, `dest` is absolute path in container)
- `output_files`: List of files to copy from container to output directory after agent completes (`src` is absolute path in container, `dest` is relative to output directory)

### task.md

Markdown file containing the problem statement given to the agent.

### preprocess.sh

Shell script that runs before the agent starts. Use this to:

- Set up the environment
- Create checksums of files that shouldn't be modified

Exit with code 0 on success, non-zero on failure.

### evaluate.sh

Runs after the agent completes. Exit 0 for PASS, non-zero for FAIL.
Print verbose output for debugging (captured in results).

> The evaluation script is automatically retried up to 3 times or until a successful evaluation. This helps handle flaky tests or non-deterministic timeouts common in some systems programming labs.

### Example Task

See `data/test_course/test__simple__echo/` for a minimal example, or `data/mit_6_5840_2024/4a_kvraft/` for an example using `starter_files` and `output_files`.

## Adding New Tasks

1. If you are adding tasks for a new course, first add a new entry to [`/data/courses.json`](./data/courses.json) with the course metadata
2. Create a new folder: `data/{course_id}/{task_id}/` (where `{course_id}` matches the entry in `courses.json`)
3. Add the 4 required files: `config.json`, `task.md`, `preprocess.sh`, `evaluate.sh` for each task
4. (Optional) Create a `starter_files/` directory and add files that should be copied to the container
5. (Optional) Configure `starter_files` and `output_files` in `config.json`
6. Make scripts executable
7. Run `python prepare_dataset.py` to regenerate `tasks.jsonl`
8. Run the benchmark
19 changes: 19 additions & 0 deletions benchmarks/courselab_bench/courselab_bench/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
__version__ = "0.1.0"

from courselab_bench.agent import REACTAgent
from courselab_bench.environment import DockerEnvironment
from courselab_bench.model import LiteLLMModel
from courselab_bench.data import load_tasks
from courselab_bench.runner import execute_task, save_trajectory
from courselab_bench.evaluation import evaluate_task, compute_summary

__all__ = [
"REACTAgent",
"DockerEnvironment",
"LiteLLMModel",
"load_tasks",
"execute_task",
"save_trajectory",
"evaluate_task",
"compute_summary",
]
3 changes: 3 additions & 0 deletions benchmarks/courselab_bench/courselab_bench/agent/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
from courselab_bench.agent.react import REACTAgent

__all__ = ["REACTAgent"]
Loading
Loading