Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 5 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,10 +17,11 @@ The benchmark framework is **still under development**. If you have any question

System Intelligence Benchmark currently includes the following example benchmarks. Each benchmark assesses specific capabilities across multiple levels within a given research direction. Some benchmarks are still under development — we're actively updating them. Stay tuned!

- **System Exam Benchmark** ([benchmarks/course_exam_bench/](benchmarks/course_exam_bench/)) - Tests LLM understanding of system concepts through university course exams (54 questions across 4 exams)
- **System Lab Benchmark** ([benchmarks/course_lab_bench/](benchmarks/course_lab_bench/)) - Assesses AI capability on practical system course labs and projects
- **System Artifact Benchmark** ([benchmarks/arteval_bench/](benchmarks/arteval_bench/)) - Evaluates AI performance on artifact evaluation
- **System Modeling Benchmark** ([benchmarks/sysmobench/](benchmarks/sysmobench/)) - Evaluates an agent's ability to produce correct TLA+ models for real-world concurrent and distributed systems, covering system capabilities across system comprehension, abstraction, and potentially tool fluency.
- **System Exam Benchmark** ([benchmarks/course_exam_bench/](benchmarks/course_exam_bench/)) [[WHY](benchmarks/course_exam_bench/WHY.md)] - Tests LLM understanding of system concepts through university course exams (54 questions across 4 exams)
- **System Lab Benchmark** ([benchmarks/course_lab_bench/](benchmarks/course_lab_bench/)) [[WHY](benchmarks/course_lab_bench/WHY.md)] - Assesses AI capability on practical system course labs and projects
- **System Artifact Benchmark** ([benchmarks/arteval_bench/](benchmarks/arteval_bench/)) [[WHY](benchmarks/arteval_bench/WHY.md)] - Evaluates AI performance on artifact evaluation
- **System Modeling Benchmark** ([benchmarks/sysmobench/](benchmarks/sysmobench/)) [[WHY](benchmarks/sysmobench/WHY.md)] - Evaluates an agent's ability to produce correct TLA+ models for real-world concurrent and distributed systems, covering system capabilities across system comprehension, abstraction, and potentially tool fluency.
- **Cache Algorithm Benchmark** ([benchmarks/cache_algo_bench/](benchmarks/cache_algo_bench/)) [[WHY](benchmarks/cache_algo_bench/WHY.md)] - Evaluates AI ability to design and implement efficient cache replacement policies optimized for diverse real-world workloads
- **Example Benchmark** ([benchmarks/example_bench/](benchmarks/example_bench/)) - Template and reference implementation for creating new benchmarks

## Quick Start
Expand Down
64 changes: 64 additions & 0 deletions benchmarks/cache_algo_bench/WHY.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
# Why Cache Algorithm Benchmark?

The Cache Algorithm Benchmark evaluates AI agents on their ability to design and implement efficient cache replacement policies—a fundamental optimization problem in storage systems, distributed computing, and system architecture. Unlike benchmarks that test implementation of known algorithms, this benchmark challenges agents to discover novel caching strategies optimized for diverse real-world workloads, testing both algorithmic reasoning and performance optimization capabilities.

## Goals and Objectives

Cache replacement policies directly impact system performance across databases, web servers, CDNs, and distributed storage. Traditional policies (LRU, LFU, ARC) work well for common access patterns but may perform poorly on specialized workloads. This benchmark evaluates whether AI agents can:

1. **Analyze Workload Characteristics**: Understand access patterns from real-world traces (Alibaba Storage, TencentBlock, Zipf distributions)
2. **Design Custom Eviction Strategies**: Create policies that minimize miss rates by exploiting workload-specific patterns
3. **Balance Trade-offs**: Optimize for cache hit rate while maintaining reasonable computational overhead
4. **Iterate and Refine**: Improve policies through multiple rounds of feedback, mimicking real algorithm development

The benchmark provides six diverse workload traces representing different access patterns:
- **alibaba-storage**: Production cloud storage workload
- **tencentblock-storage**: Block-level storage access patterns
- **ra-fwe** / **ra-multikey**: Research artifacts with specific access characteristics
- **zipf**: Synthetic workload following heavy-tailed distributions
- **tmp**: Temporal locality patterns

Success requires implementing four key functions (`evict`, `update_after_hit`, `update_after_insert`, `update_after_evict`) that collectively define a coherent caching policy.

## How This Fits Into System Intelligence

Cache algorithm design tests a unique aspect of system intelligence: **data-driven performance optimization**. This differs from other benchmarks in important ways:

- **Versus System Exam**: Moves beyond understanding existing algorithms to discovering new ones
- **Versus System Lab**: Focuses on algorithmic optimization rather than implementing specified protocols
- **Versus ArtEvalBench**: Requires designing solutions rather than reproducing existing work
- **Versus SysMoBench**: Emphasizes performance optimization over correctness verification

The benchmark specifically targets capabilities essential for practical system optimization:

**Pattern Recognition**: Identifying regularities in access traces (sequential scans, temporal locality, frequency distributions) that can be exploited

**Algorithm Design**: Translating observed patterns into concrete eviction strategies, such as:
- Recency-based policies for temporal locality
- Frequency-based policies for skewed distributions
- Hybrid approaches balancing multiple criteria

**Empirical Validation**: Evaluating policies against real workloads rather than theoretical analysis, accounting for implementation complexity and runtime overhead

**Iterative Refinement**: The benchmark's three-round feedback loop mimics real algorithm development, where initial designs undergo refinement based on performance measurements

## Practical Impact

Achieving strong performance on this benchmark would demonstrate agent capabilities directly applicable to:

- **Storage System Tuning**: Customizing cache policies for specific application workloads (databases, filesystems, object stores)
- **CDN Optimization**: Designing eviction strategies tailored to content popularity distributions
- **Memory Management**: Developing page replacement algorithms adapted to application memory access patterns
- **Distributed Caching**: Optimizing cache coherence and replacement in multi-tier architectures

The benchmark's use of real production traces (Alibaba, Tencent) ensures that successful policies have immediate practical value beyond academic optimization.

## Research Connections

This benchmark also connects to broader system intelligence research themes:

- **AutoML for Systems**: Treating system optimization as a machine learning problem
- **Workload-Adaptive Systems**: Building systems that automatically tune themselves based on observed behavior
- **Performance Engineering**: Applying data-driven methods to traditional systems problems

By requiring agents to discover effective policies through experimentation rather than implementing textbook algorithms, the Cache Algorithm Benchmark tests creative problem-solving and empirical reasoning—key components of advanced system intelligence.
25 changes: 25 additions & 0 deletions benchmarks/course_exam_bench/WHY.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# Why System Exam Benchmark?

The System Exam Benchmark evaluates whether AI agents possess foundational knowledge of core system concepts—the theoretical underpinnings necessary to reason about distributed systems, operating systems, concurrency, and fault tolerance. By testing models on real university course exams, we measure their ability to understand fundamental principles before attempting practical implementation tasks.

## Goals and Objectives

System intelligence requires more than pattern matching or code completion; it demands a deep understanding of how computing systems operate, fail, and scale. The System Exam Benchmark targets this foundational layer by presenting questions that require:

1. **Conceptual Reasoning**: Understanding distributed consensus protocols (e.g., Raft, Paxos), consistency models, and synchronization primitives
2. **Analytical Thinking**: Diagnosing failure scenarios, reasoning about race conditions, and evaluating trade-offs in system design
3. **Theoretical Knowledge**: Grasping correctness properties, performance characteristics, and fundamental limitations of system architectures

By using actual MIT course exams (6.5840 Distributed Systems, 6.1810 Operating Systems), we ensure questions reflect real educational standards and cover topics systems engineers must master. The benchmark includes single-choice, multiple-choice, true/false, and short-answer questions, allowing us to evaluate both factual recall and deeper analytical capabilities.

## How This Fits Into System Intelligence

The exam benchmark serves as a **prerequisite check** within the broader system intelligence vision. An AI agent that cannot explain why two-phase commit differs from Raft, or identify race conditions in concurrent code, will struggle with more complex tasks like debugging distributed systems, evaluating research artifacts, or designing fault-tolerant architectures.

This benchmark complements practical benchmarks (e.g., System Lab, ArtEvalBench) by:

- **Establishing Baseline Knowledge**: Verifying the model understands core concepts before applying them
- **Measuring Depth vs. Breadth**: Short-answer questions reveal whether models truly comprehend underlying mechanisms or merely memorize surface patterns
- **Providing Calibrated Comparison**: Real student performance data lets us contextualize AI capabilities against human learners

Ultimately, passing system exams demonstrates that an AI agent has internalized the conceptual foundation needed to tackle real-world system challenges—making it a critical stepping stone toward full system intelligence.
31 changes: 31 additions & 0 deletions benchmarks/course_lab_bench/WHY.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# Why System Lab Benchmark?

The System Lab Benchmark evaluates AI agents on their ability to complete realistic, hands-on system programming assignments from university courses. Unlike exam questions that test conceptual understanding, labs require end-to-end implementation: reading complex codebases, designing concurrent algorithms, writing race-free code, and passing comprehensive test suites. This benchmark measures whether agents can translate theoretical knowledge into working system components.

## Goals and Objectives

Building real systems demands capabilities far beyond answering conceptual questions. The System Lab Benchmark targets practical system intelligence by requiring agents to:

1. **Navigate Complex Codebases**: Understand existing Go implementations of distributed systems (MapReduce, Raft, key-value stores) spanning thousands of lines
2. **Implement Distributed Algorithms**: Write correct implementations of consensus protocols, replication strategies, and fault-tolerant services
3. **Handle Concurrency**: Reason about race conditions, design thread-safe data structures, and use synchronization primitives correctly
4. **Pass Rigorous Tests**: Satisfy comprehensive test suites covering normal operation, concurrent execution, and crash recovery scenarios

By using actual MIT 6.5840 Distributed Systems labs, we ensure tasks reflect real-world complexity students encounter when learning to build production-grade systems. Success requires not just generating syntactically correct code, but producing implementations that are correct, efficient, and robust under adversarial conditions.

## How This Fits Into System Intelligence

The lab benchmark represents the **bridge from theory to practice** in system intelligence. While the System Exam Benchmark tests whether agents understand distributed consensus conceptually, the System Lab Benchmark tests whether they can actually implement Raft correctly—a significantly harder challenge requiring:

- **Code Comprehension**: Reading and understanding starter code, existing interfaces, and test harnesses
- **Algorithmic Precision**: Translating protocol specifications into correct, debuggable implementations
- **Systems Thinking**: Managing state machines, handling asynchronous events, and reasoning about partial failures
- **Iterative Debugging**: Diagnosing test failures, fixing race conditions, and ensuring correctness under stress

This benchmark complements other system intelligence tasks:

- **Versus System Exam**: Moves from "can you explain Raft?" to "can you build Raft?"
- **Versus ArtEvalBench**: Focuses on creating new implementations rather than evaluating existing artifacts
- **Versus SysMoBench**: Emphasizes executable code in Go rather than formal TLA+ specifications

Completing system labs demonstrates an agent can work as a practical systems engineer—turning designs into reliable, tested implementations that handle real-world complexity. This makes it essential for achieving full system intelligence.
47 changes: 47 additions & 0 deletions benchmarks/sysmobench/WHY.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# Why Formal System Modeling?

SysMoBench evaluates whether AI agents can translate complex, real-world concurrent and distributed systems into rigorous formal specifications using TLA+. Formal modeling is essential for verifying correctness properties of critical systems, but writing and maintaining such specifications is notoriously difficult and time-consuming. This benchmark tests whether agents can bridge the gap between implementation code and mathematical models—a key capability for building trustworthy systems.

## Goals and Objectives

Formal verification provides the strongest guarantees of system correctness, yet remains underutilized because writing formal specifications requires deep expertise. SysMoBench targets this challenge by evaluating AI agents on their ability to:

1. **Comprehend Complex Systems**: Analyze real-world source code (Rust, Go, C) implementing concurrent primitives, consensus protocols, and distributed services
2. **Abstract Critical Properties**: Identify essential behaviors while omitting implementation details irrelevant to correctness
3. **Generate Executable Specifications**: Produce syntactically correct TLA+ code that passes compilation (SANY), runs successfully (TLC), and satisfies invariants
4. **Validate Against Real Behavior**: Ensure generated specifications conform to actual system execution traces and maintain specified safety/liveness properties

The benchmark includes eleven diverse systems spanning concurrency primitives (Asterinas spinlock/mutex/rwmutex/ringbuffer), consensus protocols (Etcd Raft, Redis Raft, Xline CURP, ZooKeeper), and distributed services (PGo dqueue/locksvc/raftkvs). Success requires agents to handle systems ranging from 175 to 5,360 lines of source code and produce TLA+ specifications from 75 to 508 lines.

## How This Fits Into System Intelligence

Formal modeling represents the **highest level of system abstraction**—moving from executable code to mathematical reasoning about correctness. This capability is crucial for system intelligence because:

- **Verification at Scale**: As systems grow more complex, manual testing cannot provide exhaustive correctness guarantees; formal methods can
- **Design Before Implementation**: Modeling systems in TLA+ before writing code can catch design flaws early, when they're cheapest to fix
- **Understanding Existing Systems**: Reverse-engineering formal models from legacy code helps document assumptions, invariants, and subtle correctness properties

SysMoBench complements other system benchmarks by testing a unique combination of capabilities:

- **Versus System Exam**: Moves beyond conceptual understanding to producing executable formal specifications
- **Versus System Lab**: Requires abstraction and mathematical reasoning rather than concrete implementation
- **Versus ArtEvalBench**: Focuses on specification and verification rather than artifact reproduction
- **Versus Cache Algorithm Benchmark**: Emphasizes correctness properties over performance optimization

The benchmark's four-phase evaluation pipeline (syntax → runtime → trace conformance → invariant verification) ensures agents don't just generate plausible-looking TLA+ code, but produce specifications that:

1. **Compile Successfully**: Pass SANY type-checking and syntax validation
2. **Execute Correctly**: Run without errors or deadlocks in TLC model checker
3. **Match Real Behavior**: Conform to execution traces collected from actual system implementations
4. **Preserve Invariants**: Satisfy safety and liveness properties specific to each system

## System Intelligence Impact

Achieving competence on SysMoBench would mark a significant milestone for AI-assisted system development. An agent that can reliably translate system implementations into TLA+ specifications could:

- **Accelerate Verification**: Reduce months of manual modeling effort to hours or days
- **Democratize Formal Methods**: Make rigorous verification accessible to engineers without specialized training
- **Improve System Reliability**: Enable verification of critical systems (filesystems, databases, distributed protocols) that currently rely primarily on testing
- **Support Incremental Development**: As systems evolve, automatically update specifications to match implementation changes

By testing agents on real-world systems rather than toy examples, SysMoBench ensures progress toward practical formal verification assistance—a critical component of building trustworthy, verifiable computing systems at scale.