sys-intelligence · tareknaser · Dec 11, 2025 · Dec 12, 2025 · Dec 12, 2025 · Dec 15, 2025
diff --git a/.github/workflows/test.yml b/.github/workflows/test.yml
@@ -19,6 +19,7 @@ jobs:
         benchmark:
           - example_bench
           - course_exam_bench
+          - courselab_bench
           # TODO: For now, we comment out other benchmarks as they have no tests
           # - arteval_bench
           # - cache_bench

diff --git a/benchmarks/course_lab_bench/go-python.Dockerfile b/benchmarks/course_lab_bench/go-python.Dockerfile
@@ -25,11 +25,13 @@ RUN apt-get update && apt-get install -y wget tar git build-essential \
 
 ENV PATH="/usr/local/go/bin:${PATH}"
 
-RUN python --version && go version
-
 SHELL ["/bin/bash", "-c"]
 # This is where pipx installs things
-ENV PATH="$PATH:/root/.local/bin/" 
+ENV PATH="$PATH:/root/.local/bin/"
+
+# Write PATH to profile files so it's available in login shells (bash -lc)
+RUN echo 'export PATH="/usr/local/go/bin:/root/.local/bin:$PATH"' >> /etc/profile && \
+    echo 'export PATH="/usr/local/go/bin:/root/.local/bin:$PATH"' >> /root/.bashrc
 
 RUN python --version && go version
 

diff --git a/benchmarks/courselab_bench/.env.toml.example b/benchmarks/courselab_bench/.env.toml.example
@@ -0,0 +1,17 @@
+# LLM API Keys Configuration
+# Copy this file to .env.toml and fill in your API keys
+# LiteLLM will automatically use these environment variables
+
+
+# OpenAI
+# OPENAI_API_KEY = "sk-..."
+# OPENAI_BASE_URL = "https://api.openai.com/v1"  # Optional: custom endpoint
+
+# Anthropic
+# ANTHROPIC_API_KEY = "sk-ant-..."
+
+# Azure OpenAI
+# AZURE_API_KEY = "..."
+# AZURE_API_BASE = "https://YOUR_RESOURCE.openai.azure.com"
+# AZURE_API_VERSION = "2024-02-15-preview"
+
diff --git a/benchmarks/courselab_bench/.gitignore b/benchmarks/courselab_bench/.gitignore
@@ -0,0 +1,43 @@
+# Python
+__pycache__/
+*.py[cod]
+*$py.class
+*.so
+.Python
+*.egg-info/
+dist/
+build/
+.eggs/
+
+# Virtual environments
+.venv/
+venv/
+ENV/
+env/
+
+# Testing
+.pytest_cache/
+.coverage
+htmlcov/
+*.cover
+
+# IDE
+.vscode/
+.idea/
+*.swp
+*.swo
+*~
+
+# Outputs (don't commit results)
+outputs/
+*.log
+
+# Secrets (don't commit API keys)
+configs/*secret*.yaml
+.env
+.env.toml
+
+# OS
+.DS_Store
+Thumbs.db
+data/tasks.jsonl
diff --git a/benchmarks/courselab_bench/README.md b/benchmarks/courselab_bench/README.md
@@ -0,0 +1,127 @@
+# Course Lab Benchmark
+
+A benchmark for evaluating AI agents on systems programming labs. Agents run in Docker containers and are evaluated on their ability to complete course lab assignments.
+
+We include a simple ReAct agent inspired by [mini-swe-agent](https://github.com/AUTOMATIC/mini-swe-agent).
+
+## Quick Start
+
+Make sure to export the appropriate API keys for your chosen model provider (copy `.env.toml.example` to `.env.toml` and fill in your keys). We use litellm for model access.
+
+```bash
+pip install -e .
+
+# Prepare dataset (This will generate data/tasks.jsonl using the tasks in data/)
+python prepare_dataset.py
+
+# Run all tasks
+python run_benchmark.py
+```
+
+## Usage
+
+```bash
+python run_benchmark.py \
+  --tasks data/tasks.jsonl \
+  --model anthropic/claude-sonnet-4-5-20250929 \
+  --max-steps 50 \
+  --max-cost 20.0
+```
+
+## Output
+
+Each run creates a directory with a single `results.json` file:
+
+```json
+{
+  "config": { "model": "...", "max_steps": 50, ... },
+  "summary": {
+    "total": 10,
+    "passed": 8,
+    "success_rate": 0.8,
+    "total_cost": 0.234,
+    "by_course": { "mit_6_5840_2024": { "total": 10, "passed": 8, ... } }
+  },
+  "results": [
+    {
+      "instance_id": "test__simple__echo",
+      "passed": true,
+      "agent_status": "completed",
+      "test_output": "PASS: ...",
+      "test_exit_code": 0,
+      "duration_seconds": 12.5,
+      "model_cost": 0.0033
+    }
+  ]
+}
+```
+
+Detailed agent trajectories are saved in `trajectories/{instance_id}.jsonl`.
+
+## Task Structure
+
+Tasks are organized in a folder hierarchy:
+
+```
+data/
+└── course_id/
+    └── task_id/
+        ├── config.json           # Task metadata
+        ├── task.md               # Problem statement
+        ├── preprocess.sh         # Setup script (runs before agent)
+        ├── evaluate.sh           # Evaluation script (determines pass/fail)
+        └── starter_files/        # Optional: files to copy to container
+            └── ...
+```
+
+### config.json
+
+Required fields:
+
+- `instance_id`: Unique identifier (e.g., `"test__simple__echo"`)
+- `course_id`: Course identifier (e.g., `"test_course"`)
+- `docker_image`: Docker image to use (e.g., `"xuafeng/swe-go-python:latest"`)
+
+Optional fields:
+
+- `timeout_minutes`: Maximum execution time (default: 30)
+- `tags`: List of topic tags
+- `repo_url`: Git repository to clone
+- `base_commit`: Git commit to checkout
+- `starter_files`: List of files to copy from `starter_files/` directory to container (`src` is relative to `starter_files/`, `dest` is absolute path in container)
+- `output_files`: List of files to copy from container to output directory after agent completes (`src` is absolute path in container, `dest` is relative to output directory)
+
+### task.md
+
+Markdown file containing the problem statement given to the agent.
+
+### preprocess.sh
+
+Shell script that runs before the agent starts. Use this to:
+
+- Set up the environment
+- Create checksums of files that shouldn't be modified
+
+Exit with code 0 on success, non-zero on failure.
+
+### evaluate.sh
+
+Runs after the agent completes. Exit 0 for PASS, non-zero for FAIL.
+Print verbose output for debugging (captured in results).
+
+> The evaluation script is automatically retried up to 3 times or until a successful evaluation. This helps handle flaky tests or non-deterministic timeouts common in some systems programming labs.
+
+### Example Task
+
+See `data/test_course/test__simple__echo/` for a minimal example, or `data/mit_6_5840_2024/4a_kvraft/` for an example using `starter_files` and `output_files`.
+
+## Adding New Tasks
+
+1. If you are adding tasks for a new course, first add a new entry to [`/data/courses.json`](./data/courses.json) with the course metadata
+2. Create a new folder: `data/{course_id}/{task_id}/` (where `{course_id}` matches the entry in `courses.json`)
+3. Add the 4 required files: `config.json`, `task.md`, `preprocess.sh`, `evaluate.sh` for each task
+4. (Optional) Create a `starter_files/` directory and add files that should be copied to the container
+5. (Optional) Configure `starter_files` and `output_files` in `config.json`
+6. Make scripts executable
+7. Run `python prepare_dataset.py` to regenerate `tasks.jsonl`
+8. Run the benchmark
diff --git a/benchmarks/courselab_bench/courselab_bench/__init__.py b/benchmarks/courselab_bench/courselab_bench/__init__.py
@@ -0,0 +1,19 @@
+__version__ = "0.1.0"
+
+from courselab_bench.agent import REACTAgent
+from courselab_bench.environment import DockerEnvironment
+from courselab_bench.model import LiteLLMModel
+from courselab_bench.data import load_tasks
+from courselab_bench.runner import execute_task, save_trajectory
+from courselab_bench.evaluation import evaluate_task, compute_summary
+
+__all__ = [
+    "REACTAgent",
+    "DockerEnvironment",
+    "LiteLLMModel",
+    "load_tasks",
+    "execute_task",
+    "save_trajectory",
+    "evaluate_task",
+    "compute_summary",
+]
diff --git a/benchmarks/courselab_bench/courselab_bench/agent/__init__.py b/benchmarks/courselab_bench/courselab_bench/agent/__init__.py
@@ -0,0 +1,3 @@
+from courselab_bench.agent.react import REACTAgent
+
+__all__ = ["REACTAgent"]
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,3 @@
		from courselab_bench.agent.react import REACTAgent

		__all__ = ["REACTAgent"]