sys-intelligence · xuafeng · Dec 16, 2025 · Nov 28, 2025 · Nov 28, 2025 · Nov 28, 2025
diff --git a/benchmarks/sregym/.gitignore b/benchmarks/sregym/.gitignore
@@ -0,0 +1,5 @@
+# ignore all html files in current directory
+*.html
+agent_graph.png
+*.csv
+
diff --git a/benchmarks/sregym/Dockerfile b/benchmarks/sregym/Dockerfile
@@ -0,0 +1,14 @@
+FROM ubuntu:24.04
+
+WORKDIR /usr/src
+COPY . .
+RUN apt-get update && apt-get install -y \
+    build-essential \
+    git \
+    wget \
+    python3-pip \
+    python3-venv
+
+RUN chmod +x install.sh test.sh && ./install.sh
+
+# ENTRYPOINT ["./test.sh"]
diff --git a/benchmarks/sregym/README.md b/benchmarks/sregym/README.md
@@ -0,0 +1,97 @@
+<h1>SREGym: A Benchmarking Platform for SRE Agents</h1>
+
+[🔍Overview](#🤖overview) |  [📦Installation](#📦installation) | [🚀Quick Start](#🚀quickstart) | [⚙️Usage](#⚙️usage) | [🤝Contributing](./CONTRIBUTING.md) | [📖Docs](https://sregym.com/docs) | [![Slack](https://img.shields.io/badge/-Slack-4A154B?style=flat-square&logo=slack&logoColor=white)](https://join.slack.com/t/SREGym/shared_invite/zt-3gvqxpkpc-RvCUcyBEMvzvXaQS9KtS_w)
+
+## Overview
+SREGym is an AI-native platform to enable the design, development, and evaluation of AI agents for Site Reliability Engineering (SRE). The core idea is to create live system environments for SRE agents to solve real-world SRE problems. SREGym provides a comprehensive SRE benchmark suite with a wide variety of problems for evaluating SRE agents and also for training next-generation AI agents.
+<br><br>
+
+![SREGym Overview](./sregym_core/assets/SREGymFigure.png)
+
+SREGym is inspired by our prior work on AIOpsLab and ITBench. It is architectured with AI-native usability and extensibility as first-class principles. The SREGym benchmark suites contain 86 different SRE problems. It supports all the problems from AIOpsLab and ITBench, and includes new problems such as OS-level faults, metastable failures, and concurrent failures. See our [problem set](https://sregym.com/problems) for a complete list of problems.
+
+
+In this README.md, I will quickly explain how to run SREGym within the System Intelligence Framework.
+
+For advanced use of *System Intelligence* and *SREGym*, please refer to the docs of [*System Intelligence*](https://github.com/sys-intelligence/system-intelligence-benchmark/tree/main/doc) and [*SREGym*](https://sregym.com/docs)
+
+## Architecture Explanation
+
+### Abstraction
+
+SREGym has a decoupled design which complies with *System Intelligence* philosophy.
+Here is the correspondence of the components in *System Intelligence* and *SREGym*:
+
+The `Executor` is the agent in *SREGym*, which is decoupled from the framework functionality. We have a baseline agent implementation in `sregym_core/clients/stratus/stratus_agent/` and it is run by default. If you want to bring your own agent, please follow the [Running Your Own Agent](https://sregym.com/docs/running-your-own-agent) guide.
+
+The `Evaluator` is the evaluation oracles in *SREGym*, which is decoupled from the agent implementation. 
+
+The*SREGym*'s `Conductor` serves as the `Environment` in *System Intelligence*.
+
+### Task Details
+
+- **Environment Setup**: SREGym Conductor will inject faults into the environment and lead to failures.
+- **Diagnosis**: The agent will be asked to diagnose the root cause of the failure.
+- **Mitigation**: The agent will be asked to mitigate the failure.
+- **Evaluation**: The RCA result will be evaluated by the LLM as a judge oracle, and the mitigation result will be evaluated by specifically-designed mitigation oracles.
+
+
+## Run SREGym
+
+1. Prepare `sregym_core/.env` for the configurations. You can make a copy of `sregym_core/.env.example` into `sregym_core/.env` and set the keys in the `.env` file. For System Intelligence, you need to set the API keys for all the models you want to test, like below:
+
+``` shell
+GEMINI_API_KEY="XXXXXX"
+OPENAI_API_KEY="XXXXXX"
+ANTHROPIC_API_KEY="XXXXXX"
+MOONSHOT_API_KEY="XXXXXX"
+AZURE_API_KEY="XXXXXX"
+AZURE_API_BASE="XXXXXX"
+
+```
+> If you want more pre-defined model configurations, please refer to the `sregym_core/llm_backend/configs.yaml` file and add your own configurations there. Then you can select the backend with cli argument `--model <model_id>`.
+
+> For MS Azure and AWS Bedrock, you may need more configurations.
+
+
+1. You need to make a `inventory.yml` file in the `sregym_core/scripts/ansible` directory. You can make a copy of `inventory.yml.example` into `inventory.yml` and set the hosts in the `inventory.yml` file. You can follow the instructions [here](https://github.com/SREGym/SREGym?tab=readme-ov-file#a-kubernetes-cluster-recommended) to get a cluster and set up the inventory file.
+
+2. Install the dependencies
+``` shell
+cd benchmarks/sregym
+./install.sh
+```
+
+3. Run the benchmark
+``` shell
+cd benchmarks/sregym
+./run.sh <model_name> <agent_name>
+```
+> Some tested available names are: "gemini/gemini-2.5-flash", "openai/gpt-4o", "anthropic/claude-sonnet-4-20250514", "moonshot/moonshot-v1-32k".
+
+The wrapper executes `python src/main.py --agent "stratus" --model_name "${MODEL_NAME}"` to run the benchmark.
+
+The results will be saved in the `outputs/` directory.
+``` shell
+outputs/sregym__<model>__<agent>__<timestamp>/
+├── avg_score.json     # Average score
+└── result.jsonl       # Detailed results
+```
+
+## Use the System Intelligence CLI (optional)
+
+To orchestrate SysMoBench alongside other benchmarks:
+
+```bash
+cd cli
+./run_all_local.sh <model_name> <agent_name>
+```
+
+## How to Extend the Benchmark
+
+Please refer to the [Adding New Components](https://sregym.com/docs/contributing#adding-new-components) guide in the SREGym documentation.
+
+## Contribution
+
+We strongly welcome contributions to SREGym.   
+You can report bugs, suggest features, or contribute code to SREGym in the upstream repository [SREGym](https://github.com/SREGym/SREGym).
diff --git a/benchmarks/sregym/WHY.md b/benchmarks/sregym/WHY.md
@@ -0,0 +1,28 @@
+# Why Site Reliability Engineering as an AI Training Task?
+
+`SREGym` treats the Site Reliability Engineering (SRE) incident management process as a training ground for AI agents to form core [system intelligence capabilities](https://www.sigops.org/2025/defining-system-intelligence/). During an incident, SREs must diagnose root causes in complex distributed systems, mitigate failures and solve the root cause of the failure. This makes SRE a rich, realistic testbed for AI: agents must reason across system boundaries, interpret noisy signals (logs, metrics, traces), and execute safe remediation actions, yet we believe they can be trained to reliably assist or autonomously handle critical incidents.
+
+## Goals and Objectives
+
+Site Reliability Engineering has become the standard for operating large-scale software systems. Despite best practices, the practical work of incident response remains stressful and high-stakes. To alleviate this burden, we envision automated SRE agents that execute reliable diagnosis and mitigation. Startups, cloud providers and The capability of the agents in incident response can also show the critical capabilities of agents' understanding of the system.
+
+
+## Background
+
+#### » The SRE Incident Lifecycle
+
+SREGym focuses on the core phases of the incident lifecycle, mirroring the critical tasks performed by human Site Reliability Engineers during production outages:
+
+*   **Diagnosis (Root Cause Analysis).** In a real-world incident, SREs must rapidly identify *why* a system is failing under pressure. This involves navigating complex distributed systems, correlating noisy signals (logs, metrics, traces) across the stack, and verifying hypotheses to pinpoint the underlying fault (e.g., code bug, configuration drift, or infrastructure failure).
+
+*   **Mitigation.** Once the issue is understood (or sometimes even before, to stop bleeding), SREs must determine *how* to restore service health. This requires executing safe, decisive actions—such as rolling back deployments, draining traffic, or restarting services—while carefully managing the risk of collateral damage to ensure the system returns to a healthy state and meets SLAs.
+
+#### » What makes AISRE challenging in practice?
+
+Reliability engineering is obstructed by multiple factors that make it a formidable AI challenge:
+
+1.  **Complexity and Scale**: Modern microservice architectures generate vast amounts of data. Finding a signal in the noise of terabytes of logs and metrics is non-trivial. SREGym provides a noise generator to evaluate the agents' ability to handle noise.
+2.  **Partial Observability**: Failures often occur in "blind spots" where instrumentation is missing or misleading (e.g., silent failures, heisenbugs). Text-based RCA problems, the agents are exposed to the problems directly, but in the real scenario SREGym creates, the agents need to reason the system appearance from the logs and metrics and identify the problems.
+3.  **Fail-slow**: Some failures do not cause immediate system failure, but they can cause the system to degrade over time. SREGym includes fail-slow faults to evaluate the agents' ability to find and solve these sorts of problems.
+4.  **Time-to-mitigate**: SREGym enables the evaluation of the agents' efficiency in mitigating the faults, which is a critical metric for SRE.
+
diff --git a/benchmarks/sregym/data/benchmark/tasks.jsonl b/benchmarks/sregym/data/benchmark/tasks.jsonl
@@ -0,0 +1,89 @@
+{"id": "sregym_001", "task_name": "faulty_image_correlated"}
+{"id": "sregym_002", "task_name": "update_incompatible_correlated"}
+{"id": "sregym_003", "task_name": "incorrect_image"}
+{"id": "sregym_004", "task_name": "incorrect_port_assignment"}
+{"id": "sregym_005", "task_name": "misconfig_app_hotel_res"}
+{"id": "sregym_006", "task_name": "missing_env_variable_astronomy_shop"}
+{"id": "sregym_007", "task_name": "revoke_auth_mongodb-1"}
+{"id": "sregym_008", "task_name": "revoke_auth_mongodb-2"}
+{"id": "sregym_009", "task_name": "storage_user_unregistered-1"}
+{"id": "sregym_010", "task_name": "storage_user_unregistered-2"}
+{"id": "sregym_011", "task_name": "valkey_auth_disruption"}
+{"id": "sregym_012", "task_name": "valkey_memory_disruption"}
+{"id": "sregym_013", "task_name": "capacity_decrease_rpc_retry_storm"}
+{"id": "sregym_014", "task_name": "gc_capacity_degradation"}
+{"id": "sregym_015", "task_name": "load_spike_rpc_retry_storm"}
+{"id": "sregym_016", "task_name": "assign_to_non_existent_node"}
+{"id": "sregym_017", "task_name": "auth_miss_mongodb"}
+{"id": "sregym_018", "task_name": "configmap_drift_hotel_reservation"}
+{"id": "sregym_019", "task_name": "duplicate_pvc_mounts_astronomy_shop"}
+{"id": "sregym_020", "task_name": "duplicate_pvc_mounts_hotel_reservation"}
+{"id": "sregym_021", "task_name": "duplicate_pvc_mounts_social_network"}
+{"id": "sregym_022", "task_name": "env_variable_shadowing_astronomy_shop"}
+{"id": "sregym_023", "task_name": "k8s_target_port-misconfig"}
+{"id": "sregym_024", "task_name": "liveness_probe_misconfiguration_astronomy_shop"}
+{"id": "sregym_025", "task_name": "liveness_probe_misconfiguration_hotel_reservation"}
+{"id": "sregym_026", "task_name": "liveness_probe_misconfiguration_social_network"}
+{"id": "sregym_027", "task_name": "liveness_probe_too_aggressive_astronomy_shop"}
+{"id": "sregym_028", "task_name": "liveness_probe_too_aggressive_hotel_reservation"}
+{"id": "sregym_029", "task_name": "liveness_probe_too_aggressive_social_network"}
+{"id": "sregym_030", "task_name": "missing_configmap_hotel_reservation"}
+{"id": "sregym_031", "task_name": "missing_configmap_social_network"}
+{"id": "sregym_032", "task_name": "missing_service_astronomy_shop"}
+{"id": "sregym_033", "task_name": "missing_service_hotel_reservation"}
+{"id": "sregym_034", "task_name": "missing_service_social_network"}
+{"id": "sregym_035", "task_name": "namespace_memory_limit"}
+{"id": "sregym_036", "task_name": "pod_anti_affinity_deadlock"}
+{"id": "sregym_037", "task_name": "persistent_volume_affinity_violation"}
+{"id": "sregym_038", "task_name": "pvc_claim_mismatch"}
+{"id": "sregym_039", "task_name": "rbac_misconfiguration"}
+{"id": "sregym_040", "task_name": "readiness_probe_misconfiguration_astronomy_shop"}
+{"id": "sregym_041", "task_name": "readiness_probe_misconfiguration_hotel_reservation"}
+{"id": "sregym_042", "task_name": "readiness_probe_misconfiguration_social_network"}
+{"id": "sregym_043", "task_name": "resource_request_too_large"}
+{"id": "sregym_044", "task_name": "resource_request_too_small"}
+{"id": "sregym_045", "task_name": "rolling_update_misconfigured_hotel_reservation"}
+{"id": "sregym_046", "task_name": "rolling_update_misconfigured_social_network"}
+{"id": "sregym_047", "task_name": "scale_pod_zero_social_net"}
+{"id": "sregym_048", "task_name": "service_dns_resolution_failure_astronomy_shop"}
+{"id": "sregym_049", "task_name": "service_dns_resolution_failure_social_network"}
+{"id": "sregym_050", "task_name": "sidecar_port_conflict_astronomy_shop"}
+{"id": "sregym_051", "task_name": "sidecar_port_conflict_hotel_reservation"}
+{"id": "sregym_052", "task_name": "sidecar_port_conflict_social_network"}
+{"id": "sregym_053", "task_name": "stale_coredns_config_astronomy_shop"}
+{"id": "sregym_054", "task_name": "stale_coredns_config_social_network"}
+{"id": "sregym_055", "task_name": "taint_no_toleration_social_network"}
+{"id": "sregym_056", "task_name": "wrong_bin_usage"}
+{"id": "sregym_057", "task_name": "wrong_dns_policy_astronomy_shop"}
+{"id": "sregym_058", "task_name": "wrong_dns_policy_hotel_reservation"}
+{"id": "sregym_059", "task_name": "wrong_dns_policy_social_network"}
+{"id": "sregym_060", "task_name": "wrong_service_selector_astronomy_shop"}
+{"id": "sregym_061", "task_name": "wrong_service_selector_hotel_reservation"}
+{"id": "sregym_062", "task_name": "wrong_service_selector_social_network"}
+{"id": "sregym_063", "task_name": "astronomy_shop_ad_service_failure"}
+{"id": "sregym_064", "task_name": "astronomy_shop_ad_service_high_cpu"}
+{"id": "sregym_065", "task_name": "astronomy_shop_ad_service_manual_gc"}
+{"id": "sregym_066", "task_name": "astronomy_shop_cart_service_failure"}
+{"id": "sregym_067", "task_name": "astronomy_shop_ad_service_image_slow_load"}
+{"id": "sregym_068", "task_name": "astronomy_shop_payment_service_failure"}
+{"id": "sregym_069", "task_name": "astronomy_shop_payment_service_unreachable"}
+{"id": "sregym_070", "task_name": "astronomy_shop_product_catalog_service_failure"}
+{"id": "sregym_071", "task_name": "astronomy_shop_recommendation_service_cache_failure"}
+{"id": "sregym_072", "task_name": "kafka_queue_problems"}
+{"id": "sregym_073", "task_name": "loadgenerator_flood_homepage"}
+{"id": "sregym_074", "task_name": "trainticket_f17_nested_sql_select_clause_error"}
+{"id": "sregym_075", "task_name": "trainticket_f22_sql_column_name_mismatch_error"}
+{"id": "sregym_076", "task_name": "read_error"}
+{"id": "sregym_077", "task_name": "latent_sector_error"}
+{"id": "sregym_078", "task_name": "silent_data_corruption"}
+{"id": "sregym_079", "task_name": "ingress_misroute"}
+{"id": "sregym_080", "task_name": "network_policy_block"}
+{"id": "sregym_081", "task_name": "social_net_hotel_res_astro_shop_concurrent_failures"}
+{"id": "sregym_082", "task_name": "kubelet_crash"}
+{"id": "sregym_083", "task_name": "workload_imbalance"}
+{"id": "sregym_084", "task_name": "operator_overload_replicas"}
+{"id": "sregym_085", "task_name": "operator_non_existent_storage"}
+{"id": "sregym_086", "task_name": "operator_invalid_affinity_toleration"}
+{"id": "sregym_087", "task_name": "operator_security_context_fault"}
+{"id": "sregym_088", "task_name": "operator_wrong_update_strategy_fault"}
+