diff --git a/README.md b/README.md index 8ea3103..045c56e 100644 --- a/README.md +++ b/README.md @@ -17,10 +17,11 @@ The benchmark framework is **still under development**. If you have any question System Intelligence Benchmark currently includes the following example benchmarks. Each benchmark assesses specific capabilities across multiple levels within a given research direction. Some benchmarks are still under development — we're actively updating them. Stay tuned! -- **System Exam Benchmark** ([benchmarks/course_exam_bench/](benchmarks/course_exam_bench/)) - Tests LLM understanding of system concepts through university course exams (54 questions across 4 exams) -- **System Lab Benchmark** ([benchmarks/course_lab_bench/](benchmarks/course_lab_bench/)) - Assesses AI capability on practical system course labs and projects -- **System Artifact Benchmark** ([benchmarks/arteval_bench/](benchmarks/arteval_bench/)) - Evaluates AI performance on artifact evaluation -- **System Modeling Benchmark** ([benchmarks/sysmobench/](benchmarks/sysmobench/)) - Evaluates an agent's ability to produce correct TLA+ models for real-world concurrent and distributed systems, covering system capabilities across system comprehension, abstraction, and potentially tool fluency. +- **System Exam Benchmark** ([benchmarks/course_exam_bench/](benchmarks/course_exam_bench/)) [[WHY](benchmarks/course_exam_bench/WHY.md)] - Tests LLM understanding of system concepts through university course exams (54 questions across 4 exams) +- **System Lab Benchmark** ([benchmarks/course_lab_bench/](benchmarks/course_lab_bench/)) [[WHY](benchmarks/course_lab_bench/WHY.md)] - Assesses AI capability on practical system course labs and projects +- **System Artifact Benchmark** ([benchmarks/arteval_bench/](benchmarks/arteval_bench/)) [[WHY](benchmarks/arteval_bench/WHY.md)] - Evaluates AI performance on artifact evaluation +- **System Modeling Benchmark** ([benchmarks/sysmobench/](benchmarks/sysmobench/)) [[WHY](benchmarks/sysmobench/WHY.md)] - Evaluates an agent's ability to produce correct TLA+ models for real-world concurrent and distributed systems, covering system capabilities across system comprehension, abstraction, and potentially tool fluency. +- **Cache Algorithm Benchmark** ([benchmarks/cache_algo_bench/](benchmarks/cache_algo_bench/)) [[WHY](benchmarks/cache_algo_bench/WHY.md)] - Evaluates AI ability to design and implement efficient cache replacement policies optimized for diverse real-world workloads - **Example Benchmark** ([benchmarks/example_bench/](benchmarks/example_bench/)) - Template and reference implementation for creating new benchmarks ## Quick Start diff --git a/benchmarks/cache_algo_bench/WHY.md b/benchmarks/cache_algo_bench/WHY.md new file mode 100644 index 0000000..8452980 --- /dev/null +++ b/benchmarks/cache_algo_bench/WHY.md @@ -0,0 +1,64 @@ +# Why Cache Algorithm Benchmark? + +The Cache Algorithm Benchmark evaluates AI agents on their ability to design and implement efficient cache replacement policies—a fundamental optimization problem in storage systems, distributed computing, and system architecture. Unlike benchmarks that test implementation of known algorithms, this benchmark challenges agents to discover novel caching strategies optimized for diverse real-world workloads, testing both algorithmic reasoning and performance optimization capabilities. + +## Goals and Objectives + +Cache replacement policies directly impact system performance across databases, web servers, CDNs, and distributed storage. Traditional policies (LRU, LFU, ARC) work well for common access patterns but may perform poorly on specialized workloads. This benchmark evaluates whether AI agents can: + +1. **Analyze Workload Characteristics**: Understand access patterns from real-world traces (Alibaba Storage, TencentBlock, Zipf distributions) +2. **Design Custom Eviction Strategies**: Create policies that minimize miss rates by exploiting workload-specific patterns +3. **Balance Trade-offs**: Optimize for cache hit rate while maintaining reasonable computational overhead +4. **Iterate and Refine**: Improve policies through multiple rounds of feedback, mimicking real algorithm development + +The benchmark provides six diverse workload traces representing different access patterns: +- **alibaba-storage**: Production cloud storage workload +- **tencentblock-storage**: Block-level storage access patterns +- **ra-fwe** / **ra-multikey**: Research artifacts with specific access characteristics +- **zipf**: Synthetic workload following heavy-tailed distributions +- **tmp**: Temporal locality patterns + +Success requires implementing four key functions (`evict`, `update_after_hit`, `update_after_insert`, `update_after_evict`) that collectively define a coherent caching policy. + +## How This Fits Into System Intelligence + +Cache algorithm design tests a unique aspect of system intelligence: **data-driven performance optimization**. This differs from other benchmarks in important ways: + +- **Versus System Exam**: Moves beyond understanding existing algorithms to discovering new ones +- **Versus System Lab**: Focuses on algorithmic optimization rather than implementing specified protocols +- **Versus ArtEvalBench**: Requires designing solutions rather than reproducing existing work +- **Versus SysMoBench**: Emphasizes performance optimization over correctness verification + +The benchmark specifically targets capabilities essential for practical system optimization: + +**Pattern Recognition**: Identifying regularities in access traces (sequential scans, temporal locality, frequency distributions) that can be exploited + +**Algorithm Design**: Translating observed patterns into concrete eviction strategies, such as: +- Recency-based policies for temporal locality +- Frequency-based policies for skewed distributions +- Hybrid approaches balancing multiple criteria + +**Empirical Validation**: Evaluating policies against real workloads rather than theoretical analysis, accounting for implementation complexity and runtime overhead + +**Iterative Refinement**: The benchmark's three-round feedback loop mimics real algorithm development, where initial designs undergo refinement based on performance measurements + +## Practical Impact + +Achieving strong performance on this benchmark would demonstrate agent capabilities directly applicable to: + +- **Storage System Tuning**: Customizing cache policies for specific application workloads (databases, filesystems, object stores) +- **CDN Optimization**: Designing eviction strategies tailored to content popularity distributions +- **Memory Management**: Developing page replacement algorithms adapted to application memory access patterns +- **Distributed Caching**: Optimizing cache coherence and replacement in multi-tier architectures + +The benchmark's use of real production traces (Alibaba, Tencent) ensures that successful policies have immediate practical value beyond academic optimization. + +## Research Connections + +This benchmark also connects to broader system intelligence research themes: + +- **AutoML for Systems**: Treating system optimization as a machine learning problem +- **Workload-Adaptive Systems**: Building systems that automatically tune themselves based on observed behavior +- **Performance Engineering**: Applying data-driven methods to traditional systems problems + +By requiring agents to discover effective policies through experimentation rather than implementing textbook algorithms, the Cache Algorithm Benchmark tests creative problem-solving and empirical reasoning—key components of advanced system intelligence. diff --git a/benchmarks/course_exam_bench/WHY.md b/benchmarks/course_exam_bench/WHY.md new file mode 100644 index 0000000..04a2e28 --- /dev/null +++ b/benchmarks/course_exam_bench/WHY.md @@ -0,0 +1,25 @@ +# Why System Exam Benchmark? + +The System Exam Benchmark evaluates whether AI agents possess foundational knowledge of core system concepts—the theoretical underpinnings necessary to reason about distributed systems, operating systems, concurrency, and fault tolerance. By testing models on real university course exams, we measure their ability to understand fundamental principles before attempting practical implementation tasks. + +## Goals and Objectives + +System intelligence requires more than pattern matching or code completion; it demands a deep understanding of how computing systems operate, fail, and scale. The System Exam Benchmark targets this foundational layer by presenting questions that require: + +1. **Conceptual Reasoning**: Understanding distributed consensus protocols (e.g., Raft, Paxos), consistency models, and synchronization primitives +2. **Analytical Thinking**: Diagnosing failure scenarios, reasoning about race conditions, and evaluating trade-offs in system design +3. **Theoretical Knowledge**: Grasping correctness properties, performance characteristics, and fundamental limitations of system architectures + +By using actual MIT course exams (6.5840 Distributed Systems, 6.1810 Operating Systems), we ensure questions reflect real educational standards and cover topics systems engineers must master. The benchmark includes single-choice, multiple-choice, true/false, and short-answer questions, allowing us to evaluate both factual recall and deeper analytical capabilities. + +## How This Fits Into System Intelligence + +The exam benchmark serves as a **prerequisite check** within the broader system intelligence vision. An AI agent that cannot explain why two-phase commit differs from Raft, or identify race conditions in concurrent code, will struggle with more complex tasks like debugging distributed systems, evaluating research artifacts, or designing fault-tolerant architectures. + +This benchmark complements practical benchmarks (e.g., System Lab, ArtEvalBench) by: + +- **Establishing Baseline Knowledge**: Verifying the model understands core concepts before applying them +- **Measuring Depth vs. Breadth**: Short-answer questions reveal whether models truly comprehend underlying mechanisms or merely memorize surface patterns +- **Providing Calibrated Comparison**: Real student performance data lets us contextualize AI capabilities against human learners + +Ultimately, passing system exams demonstrates that an AI agent has internalized the conceptual foundation needed to tackle real-world system challenges—making it a critical stepping stone toward full system intelligence. diff --git a/benchmarks/course_lab_bench/WHY.md b/benchmarks/course_lab_bench/WHY.md new file mode 100644 index 0000000..b08869b --- /dev/null +++ b/benchmarks/course_lab_bench/WHY.md @@ -0,0 +1,31 @@ +# Why System Lab Benchmark? + +The System Lab Benchmark evaluates AI agents on their ability to complete realistic, hands-on system programming assignments from university courses. Unlike exam questions that test conceptual understanding, labs require end-to-end implementation: reading complex codebases, designing concurrent algorithms, writing race-free code, and passing comprehensive test suites. This benchmark measures whether agents can translate theoretical knowledge into working system components. + +## Goals and Objectives + +Building real systems demands capabilities far beyond answering conceptual questions. The System Lab Benchmark targets practical system intelligence by requiring agents to: + +1. **Navigate Complex Codebases**: Understand existing Go implementations of distributed systems (MapReduce, Raft, key-value stores) spanning thousands of lines +2. **Implement Distributed Algorithms**: Write correct implementations of consensus protocols, replication strategies, and fault-tolerant services +3. **Handle Concurrency**: Reason about race conditions, design thread-safe data structures, and use synchronization primitives correctly +4. **Pass Rigorous Tests**: Satisfy comprehensive test suites covering normal operation, concurrent execution, and crash recovery scenarios + +By using actual MIT 6.5840 Distributed Systems labs, we ensure tasks reflect real-world complexity students encounter when learning to build production-grade systems. Success requires not just generating syntactically correct code, but producing implementations that are correct, efficient, and robust under adversarial conditions. + +## How This Fits Into System Intelligence + +The lab benchmark represents the **bridge from theory to practice** in system intelligence. While the System Exam Benchmark tests whether agents understand distributed consensus conceptually, the System Lab Benchmark tests whether they can actually implement Raft correctly—a significantly harder challenge requiring: + +- **Code Comprehension**: Reading and understanding starter code, existing interfaces, and test harnesses +- **Algorithmic Precision**: Translating protocol specifications into correct, debuggable implementations +- **Systems Thinking**: Managing state machines, handling asynchronous events, and reasoning about partial failures +- **Iterative Debugging**: Diagnosing test failures, fixing race conditions, and ensuring correctness under stress + +This benchmark complements other system intelligence tasks: + +- **Versus System Exam**: Moves from "can you explain Raft?" to "can you build Raft?" +- **Versus ArtEvalBench**: Focuses on creating new implementations rather than evaluating existing artifacts +- **Versus SysMoBench**: Emphasizes executable code in Go rather than formal TLA+ specifications + +Completing system labs demonstrates an agent can work as a practical systems engineer—turning designs into reliable, tested implementations that handle real-world complexity. This makes it essential for achieving full system intelligence. diff --git a/benchmarks/sysmobench/WHY.md b/benchmarks/sysmobench/WHY.md new file mode 100644 index 0000000..4d0a720 --- /dev/null +++ b/benchmarks/sysmobench/WHY.md @@ -0,0 +1,47 @@ +# Why Formal System Modeling? + +SysMoBench evaluates whether AI agents can translate complex, real-world concurrent and distributed systems into rigorous formal specifications using TLA+. Formal modeling is essential for verifying correctness properties of critical systems, but writing and maintaining such specifications is notoriously difficult and time-consuming. This benchmark tests whether agents can bridge the gap between implementation code and mathematical models—a key capability for building trustworthy systems. + +## Goals and Objectives + +Formal verification provides the strongest guarantees of system correctness, yet remains underutilized because writing formal specifications requires deep expertise. SysMoBench targets this challenge by evaluating AI agents on their ability to: + +1. **Comprehend Complex Systems**: Analyze real-world source code (Rust, Go, C) implementing concurrent primitives, consensus protocols, and distributed services +2. **Abstract Critical Properties**: Identify essential behaviors while omitting implementation details irrelevant to correctness +3. **Generate Executable Specifications**: Produce syntactically correct TLA+ code that passes compilation (SANY), runs successfully (TLC), and satisfies invariants +4. **Validate Against Real Behavior**: Ensure generated specifications conform to actual system execution traces and maintain specified safety/liveness properties + +The benchmark includes eleven diverse systems spanning concurrency primitives (Asterinas spinlock/mutex/rwmutex/ringbuffer), consensus protocols (Etcd Raft, Redis Raft, Xline CURP, ZooKeeper), and distributed services (PGo dqueue/locksvc/raftkvs). Success requires agents to handle systems ranging from 175 to 5,360 lines of source code and produce TLA+ specifications from 75 to 508 lines. + +## How This Fits Into System Intelligence + +Formal modeling represents the **highest level of system abstraction**—moving from executable code to mathematical reasoning about correctness. This capability is crucial for system intelligence because: + +- **Verification at Scale**: As systems grow more complex, manual testing cannot provide exhaustive correctness guarantees; formal methods can +- **Design Before Implementation**: Modeling systems in TLA+ before writing code can catch design flaws early, when they're cheapest to fix +- **Understanding Existing Systems**: Reverse-engineering formal models from legacy code helps document assumptions, invariants, and subtle correctness properties + +SysMoBench complements other system benchmarks by testing a unique combination of capabilities: + +- **Versus System Exam**: Moves beyond conceptual understanding to producing executable formal specifications +- **Versus System Lab**: Requires abstraction and mathematical reasoning rather than concrete implementation +- **Versus ArtEvalBench**: Focuses on specification and verification rather than artifact reproduction +- **Versus Cache Algorithm Benchmark**: Emphasizes correctness properties over performance optimization + +The benchmark's four-phase evaluation pipeline (syntax → runtime → trace conformance → invariant verification) ensures agents don't just generate plausible-looking TLA+ code, but produce specifications that: + +1. **Compile Successfully**: Pass SANY type-checking and syntax validation +2. **Execute Correctly**: Run without errors or deadlocks in TLC model checker +3. **Match Real Behavior**: Conform to execution traces collected from actual system implementations +4. **Preserve Invariants**: Satisfy safety and liveness properties specific to each system + +## System Intelligence Impact + +Achieving competence on SysMoBench would mark a significant milestone for AI-assisted system development. An agent that can reliably translate system implementations into TLA+ specifications could: + +- **Accelerate Verification**: Reduce months of manual modeling effort to hours or days +- **Democratize Formal Methods**: Make rigorous verification accessible to engineers without specialized training +- **Improve System Reliability**: Enable verification of critical systems (filesystems, databases, distributed protocols) that currently rely primarily on testing +- **Support Incremental Development**: As systems evolve, automatically update specifications to match implementation changes + +By testing agents on real-world systems rather than toy examples, SysMoBench ensures progress toward practical formal verification assistance—a critical component of building trustworthy, verifiable computing systems at scale.