Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
250 changes: 151 additions & 99 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,149 +1,201 @@
# System Intelligence Benchmark: A Benchmark Suite for Evaluating LLM's System Capabilities
# How to Port an Existing Benchmark

System Intelligence Benchmark is a comprehensive benchmark suite for evaluating the performance of Large Language Models (LLMs) and AI systems across critical system capabilities. It features tutorial, example benchmarks and offers both CLI tools and an SDK for further development.
A guide for integrating mature, independently-developed benchmarks using `SysMoBench` as an example.

## Benchmark Overview
A benchmark is a standard or point of reference against which things may be compared or assessed. In the context of AI and LLMs, benchmarks are essential for evaluating model capabilities, guiding research directions, and measuring progress.

### Benchmark Framework
## Step 1: Choose Git Integration Method

To advance benchmark development, we propose the System Intelligence Benchmark, a modular and extensible framework designed to support diverse research domains and problem types. As shown in the below figure, the framework comprises four abstractions: task set, environment, executor, and evaluator. Each task is associated with a specific environment, wherein the executor generates a solution that is subsequently assessed by the evaluator, which returns the evaluation metrics. This design enables the flexible integration of heterogeneous agents and their systematic evaluation. Additionally, the framework includes built-in executors (agents), evaluators (methodologies and grading rubrics), and tutorials. In an ideal case, users need only supply tasks that represent specific capabilities, select an evaluator, and quickly create and run a new benchmark. You can see [benchmark_abstraction.md](doc/benchmark_abstract.md) for details.
**Git Subtree vs Submodule:**

<img src="doc/benchmark.png" alt="Dashboard Screenshot" width="600"/>
When porting an existing benchmark, you need to decide how to integrate the upstream benchmark code into the framework repository (i.e., `system-intelligence-benchmark`). While both Git Subtree and Git Submodule can work, we recommend **Git Subtree** for most benchmark porting scenarios.

The benchmark framework is **still under development**. If you have any questions, feel free to open an issue or contact us directly.
**Why Subtree over Submodule for Benchmarks:**
- **Atomic commits and consistency**: Subtree keeps all code in the main repository's Git object database, avoiding state synchronization issues between the parent repo and submodule HEAD. You can modify framework code and benchmark code in a single atomic commit, ensuring consistency across the entire codebase.
- **Bidirectional sync flexibility**: `git subtree pull --squash` cleanly integrates upstream updates while maintaining repository history, and `git subtree push` enables contributing patches back to upstream.
- **Fewer gotchas**: Submodules have many edge cases that can confuse contributors (see [this detailed analysis](https://blog.timhutt.co.uk/against-submodules/))

### Benchmarks
**When to use Subtree:**
- Benchmark is relatively stable (not updated daily)
- Repository size is acceptable (most benchmarks are <100MB)
- You want contributors to have a smooth onboarding experience

System Intelligence Benchmark currently includes the following example benchmarks. Each benchmark assesses specific capabilities across multiple levels within a given research direction. Some benchmarks are still under development — we're actively updating them. Stay tuned!
**When Submodule might be acceptable:**
- Upstream updates extremely frequently
- Benchmark codebase is very large (>500MB)
- You need strict separation between upstream and integration code

- **System Exam Benchmark** ([benchmarks/course_exam_bench/](benchmarks/course_exam_bench/)) - Tests LLM understanding of system concepts through university course exams (54 questions across 4 exams)
- **System Lab Benchmark** ([benchmarks/course_lab_bench/](benchmarks/course_lab_bench/)) - Assesses AI capability on practical system course labs and projects
- **System Artifact Benchmark** ([benchmarks/arteval_bench/](benchmarks/arteval_bench/)) - Evaluates AI performance on artifact evaluation
- **System Modeling Benchmark** ([benchmarks/sysmobench/](benchmarks/sysmobench/)) - Evaluates an agent's ability to produce correct TLA+ models for real-world concurrent and distributed systems, covering system capabilities across system comprehension, abstraction, and potentially tool fluency.
- **Example Benchmark** ([benchmarks/example_bench/](benchmarks/example_bench/)) - Template and reference implementation for creating new benchmarks

## Quick Start
### Repo Structure
## Step 2: Add Upstream as Git Subtree

- **Benchmarks** (`benchmarks/`) - Contains individual benchmark implementations, each with its own source code, tests, and configuration
- **CLI Tools** (`cli/`) - Command-line interface for running benchmarks and managing evaluations
- **SDK** (`sdk/`) - Software development kit providing evaluators, LLM interfaces, and utility functions
- **Documentation** (`doc/`) - Guides and documentation for using and contributing to SysCapBench
```bash
# Add remote
git remote add benchmark-upstream https://github.com/upstream/repo.git

### Prerequisites
# Add as subtree
git subtree add --prefix benchmarks/your_benchmark/benchmark_core \
benchmark-upstream main --squash
```

- Python 3.9+
- Docker (optional, for containerized execution)

> Docker images currently only support x86_64/AMD64 architecture. ARM64 (Apple Silicon M1/M2/M3) is not yet supported
## Step 3: Create Directory Structure

### Installation
```
benchmarks/your_benchmark/
├── benchmark_core/ # Git Subtree (DO NOT manually edit)
├── src/ # Bridge layer
│ ├── main.py # The wrapper of the entry point of your benchmark, including the benchmark driving logic
│ ├── executor.py
│ └── evaluator.py
├── data/benchmark/
│ └── tasks.jsonl
├── env.toml # Config template with "XXX" placeholders
├── requirements.txt # -r benchmark_core/requirements.txt
├── install.sh
├── run.sh
└── README.md
```

1. Clone the repository:
## Step 4: Write Adapter Layer

```bash
git clone https://github.com/sys-intelligence/system-intelligence-benchmark.git
cd system-intelligence-benchmark
```
Mature benchmarks already have end-to-end execution pipelines and SDKs. However, to unify LLM/agent configuration management across the framework and improve maintainability (see [Benchmark Abstraction](benchmark_abstract.md)), we need an **adapter layer** in `benchmarks/your_benchmark/src/` to bridge the upstream benchmark with the framework.

2. Install dependencies for a specific benchmark:
### 4.1 Integrate Model Config Manager

```bash
cd cli
./install.sh
```
3. Each benchmark includes an `env.toml` file for configuration. You should add your own llm endpoint url and key there.
The framework provides a centralized model configuration manager. You may have two options to integrate it:

### Running Benchmarks
**Option 1: Replace upstream config manager (Recommended)**

#### Run All Benchmarks
Directly inject the framework's config into the upstream benchmark's initialization. This is the simplest and most stable approach.

To run all benchmarks sequentially:
```python
# src/main.py
import sys
from pathlib import Path

```bash
cd cli
./run_all_local.sh <model_name>
# Add paths
SDK_ROOT = Path(__file__).parent.parent.parent.parent
BENCHMARK_CORE = Path(__file__).parent.parent / "benchmark_core"
sys.path.insert(0, str(SDK_ROOT))
sys.path.insert(0, str(BENCHMARK_CORE))

# Inject framework config
from sdk.utils import set_llm_endpoint_from_config
set_llm_endpoint_from_config(str(Path(__file__).parent.parent / 'env.toml'))

# Now import upstream - it will use framework's LLM config
from tla_eval.config import get_configured_model
```

#### Run a Single Benchmark
**SysMoBench example**: Directly replaces upstream's `models.yaml` by setting environment variables before importing upstream modules.

To run just one benchmark locally:
**Option 2: Map framework config to upstream config**

```bash
cd benchmarks/<benchmark_name>
./install.sh # Only needed the first time
./run.sh <model_name>
If the upstream config system cannot be replaced, map the framework's config to upstream's format at runtime. Implementation depends on your specific benchmark.


### 4.2 Separate Executor and Evaluator (Recommended)

Any benchmark can be abstracted into two sequential modules: **Executor** (generation/interaction) and **Evaluator** (scoring). Separating them improves code clarity and extensibility, and enables integrating more sophisticated executors without modifying evaluation logic.

**Executor**: Handles the generation or iterative correction workflow
- Example: SysMoBench runs multi-phase generation with iterative error correction
- Encapsulates retry logic, model calls, and intermediate outputs

**Evaluator**: Performs the final evaluation
- Example: SysMoBench runs TLC checks and verification
- Returns standardized scores and diagnostic information

> If you have already had decoupling design in your benchmark, you can skip this step, and clarify the substitution of the framework's agent/environment/evaluator in the README.md. You can simply add wrapper here to expose model/agent as parameters in the `main.py` file.

### 4.3 Define Task Format

Convert the upstream task format to the framework's standard `tasks.jsonl` schema. This decouples task definitions from execution logic, enabling `main.py` to iterate over tasks programmatically without hardcoding task-specific details.

```jsonl
{"task_id": "task_1", "description": "...", "metadata": {}}
{"task_id": "task_2", "description": "...", "metadata": {}}
```

#### Output Format

Benchmarks generate standardized outputs in `cli/outputs/{benchmark_name}__{model_name}__{agent}_{timestamp}/`:
## Step 5: Complete Integration

- `result.jsonl`: Detailed evaluation results
- `summary.json`: Aggregated performance metrics
- Test-specific breakdowns and comparisons
Most remaining steps (testing, documentation, root-level integration) are identical to creating a custom benchmark. See [Creating New Benchmarks](creating_benchmark.md) for detailed guidelines.

You can find more detailed usage guides in the CLI [README.md](cli/README.md).
**Porting-specific considerations:**

## Contribute to Benchmarks
### 5.1 Manage Dependencies

We welcome community contributions to enrich existing benchmarks (e.g., by adding more exam problems to the System Exam benchmark and more system artifacts to System Artifact and System Modeling benchmark), port your existing benchmarks, and more importantly to create new system intelligence benchmarks with our framework. See below for detailed instructions. We believe that such collective community efforts will advance AI to its next level and help realize System Intelligence, unlocking the potential of AI-driven computing system innovations. If you are interested in contributing or already have good system benchmarks, please let us know. We have set up a [slack channel](https://join.slack.com/t/sys-intelligence/shared_invite/zt-3hpkgr2aa-NnuPxUbyHr45S89DFi_N1A) at sys-intelligence.slack.com.
Reference upstream dependencies in `requirements.txt`:

> [!NOTE]
> We suggest getting starting by walking through the basic concept of a AI benchmark: [Benchmark Abstraction](doc/benchmark_abstract.md). After understanding the basic concept, you can decide whether to Contribute to Existing Benchmarks, Porting Existing Benchmarks, or Creating New Benchmarks.
```txt
# requirements.txt
-r benchmark_core/requirements.txt
```

### Contribute to Existing Benchmarks
The easiest way to contribute is to add more tasks to existing benchmarks. Currently, the following two are highly recommended. You can simply follow the provided guidelines to submit your data—once that’s done, you’re all set.
- **SystemExam**: If you are a professor teaching one or more courses, we highly recommend contributing **more exam problems** to SystemExam (see [this doc](https://github.com/sys-intelligence/system-intelligence-benchmark/tree/main/benchmarks/course_exam_bench#how-to-extend-the-benchmark) for step-by-step guidance).
- **SystemArtifact**: If you are a researcher submitting artifacts, or an AE chair involved in artifact evaluation, we highly recommend contributing **more system artifacts** to SystemArtifact (see [this doc](https://github.com/sys-intelligence/system-intelligence-benchmark/blob/main/benchmarks/arteval_bench/README.md) for step-by-step guidance).
### 5.2 Create install.sh

In addition, you can also help review the existing benchmarks to propose improvement ideas or directly enhance them—for example, by adding more advanced evaluators or incorporating improved metrics.
Install upstream dependencies:

### Porting Existing Benchmarks
> [!NOTE]
> See [porting_benchmark.md](doc/porting_benchmark.md) for step-by-step guidelines.
```bash
#!/bin/bash
set -e

# Install upstream system dependencies (e.g., Java for SysMoBench)
# ...

python3 -m venv .venv
source .venv/bin/activate

pip install -r requirements.txt

# Run upstream setup scripts
python3 benchmark_core/scripts/setup.py

For integrating existing, independently-developed benchmark projects while maintaining synchronization with upstream:
deactivate
```

### 5.3 Configure .gitignore

- Use Git Subtree/Submodule to incorporate upstream code
- Write a bridge layer to connect upstream evaluators with framework SDK
- Configure bidirectional sync for pulling updates and contributing fixes
Exclude upstream-generated files:

**Example:** [SysMoBench](benchmarks/sysmobench/) - ported from [SysSpecBench](https://github.com/specula-org/SysSpecBench)
```gitignore
# Exclude upstream runtime artifacts
benchmark_core/lib/
benchmark_core/output/
benchmark_core/.venv/
```

### Creating New Benchmarks
> [!NOTE]
> See [custom_benchmark.md](doc/creating_benchmark.md) for step-by-step guidelines.

To create a new benchmark, follow these steps:
1. Create a new benchmark directory in `benchmarks/`
1. Based on your specific requirements, select and copy an example benchmark as a starting point
2. Update the `src/main.py` file with your specific evaluation logic (your executor and evaluator)
3. Add test cases in the `tests/` directory
2. Update the README.md with benchmark-specific details
3. Implement `install.sh` and `run.sh` scripts
4. Update the benchmark list in `run_all_local.sh` and `run_docker.sh` if needed
### 5.4 Other Steps

## Contributing
- **Tests**: See [creating_benchmark.md - Testing](creating_benchmark.md#testing)
- **README**: Document upstream source, version, and attribution
- **Root integration**: Update `cli/run_all_local.sh`, `README.md`

This project welcomes contributions and suggestions. Most contributions require you to agree to a
Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us
the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
### 5.5 Test the integration

When you submit a pull request, a CLA bot will automatically determine whether you need to provide
a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions
provided by the bot. You will only need to do this once across all repos using our CLA.
Make sure the benchmark can be run locally at least in the two ways below:

```bash
./run.sh <model_name>
```

This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or
contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.
```bash
cd cli
./run_all_local.sh <model_name>
```

## Trademarks
> Be careful for the path configuration when porting the benchmark.
## Sync with Upstream

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft
trademarks or logos is subject to and must follow
[Microsoft's Trademark & Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general).
Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship.
Any use of third-party trademarks or logos are subject to those third-party's policies.
**Update:**
```bash
git subtree pull --prefix benchmarks/your_benchmark/benchmark_core \
benchmark-upstream main --squash
```

**Contribute back:**
```bash
git subtree push --prefix benchmarks/your_benchmark/benchmark_core \
benchmark-upstream feature-branch
```