Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
5 changes: 5 additions & 0 deletions benchmarks/sregym/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# ignore all html files in current directory
*.html
agent_graph.png
*.csv

14 changes: 14 additions & 0 deletions benchmarks/sregym/Dockerfile
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this container might fail because the entry point is ./test.sh. Docker runs this script as the first process and when it finishes (yours simply terminates without running anything) the container exits as well. Also, my understanding is that any docker run image <...> command becomes an argument to test.sh which, in your case, doesn't process any (again, because it exits immediately). You might want to check out an example from ArtEvalBench: https://github.com/sys-intelligence/system-intelligence-benchmark/blob/main/benchmarks/arteval_bench/Dockerfile

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Jackcuii @xuafeng we might actually want to run the Docker image to make sure? What do you think?

Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
FROM ubuntu:24.04

WORKDIR /usr/src
COPY . .
RUN apt-get update && apt-get install -y \
build-essential \
git \
wget \
python3-pip \
python3-venv

RUN chmod +x install.sh test.sh && ./install.sh

# ENTRYPOINT ["./test.sh"]
97 changes: 97 additions & 0 deletions benchmarks/sregym/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
<h1>SREGym: A Benchmarking Platform for SRE Agents</h1>

[🔍Overview](#🤖overview) | [📦Installation](#📦installation) | [🚀Quick Start](#🚀quickstart) | [⚙️Usage](#⚙️usage) | [🤝Contributing](./CONTRIBUTING.md) | [📖Docs](https://sregym.com/docs) | [![Slack](https://img.shields.io/badge/-Slack-4A154B?style=flat-square&logo=slack&logoColor=white)](https://join.slack.com/t/SREGym/shared_invite/zt-3gvqxpkpc-RvCUcyBEMvzvXaQS9KtS_w)

## Overview
SREGym is an AI-native platform to enable the design, development, and evaluation of AI agents for Site Reliability Engineering (SRE). The core idea is to create live system environments for SRE agents to solve real-world SRE problems. SREGym provides a comprehensive SRE benchmark suite with a wide variety of problems for evaluating SRE agents and also for training next-generation AI agents.
<br><br>

![SREGym Overview](./sregym_core/assets/SREGymFigure.png)

SREGym is inspired by our prior work on AIOpsLab and ITBench. It is architectured with AI-native usability and extensibility as first-class principles. The SREGym benchmark suites contain 86 different SRE problems. It supports all the problems from AIOpsLab and ITBench, and includes new problems such as OS-level faults, metastable failures, and concurrent failures. See our [problem set](https://sregym.com/problems) for a complete list of problems.


In this README.md, I will quickly explain how to run SREGym within the System Intelligence Framework.

For advanced use of *System Intelligence* and *SREGym*, please refer to the docs of [*System Intelligence*](https://github.com/sys-intelligence/system-intelligence-benchmark/tree/main/doc) and [*SREGym*](https://sregym.com/docs)

## Architecture Explanation

### Abstraction

SREGym has a decoupled design which complies with *System Intelligence* philosophy.
Here is the correspondence of the components in *System Intelligence* and *SREGym*:

The `Executor` is the agent in *SREGym*, which is decoupled from the framework functionality. We have a baseline agent implementation in `sregym_core/clients/stratus/stratus_agent/` and it is run by default. If you want to bring your own agent, please follow the [Running Your Own Agent](https://sregym.com/docs/running-your-own-agent) guide.

The `Evaluator` is the evaluation oracles in *SREGym*, which is decoupled from the agent implementation.

The*SREGym*'s `Conductor` serves as the `Environment` in *System Intelligence*.

### Task Details

- **Environment Setup**: SREGym Conductor will inject faults into the environment and lead to failures.
- **Diagnosis**: The agent will be asked to diagnose the root cause of the failure.
- **Mitigation**: The agent will be asked to mitigate the failure.
- **Evaluation**: The RCA result will be evaluated by the LLM as a judge oracle, and the mitigation result will be evaluated by specifically-designed mitigation oracles.


## Run SREGym

1. Prepare `sregym_core/.env` for the configurations. You can make a copy of `sregym_core/.env.example` into `sregym_core/.env` and set the keys in the `.env` file. For System Intelligence, you need to set the API keys for all the models you want to test, like below:

``` shell
GEMINI_API_KEY="XXXXXX"
OPENAI_API_KEY="XXXXXX"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about AzureOpenAI and our own host open source model? Do we support it? I think we need to set an endpoint_url?

Copy link
Collaborator Author

@Jackcuii Jackcuii Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can solve this together with the env file issue. I discussed a bit with the team. They tend to offer more direct exposure of the llm backend so I need to work on a bit on the SREGym side 😃

ANTHROPIC_API_KEY="XXXXXX"
MOONSHOT_API_KEY="XXXXXX"
AZURE_API_KEY="XXXXXX"
AZURE_API_BASE="XXXXXX"

```
> If you want more pre-defined model configurations, please refer to the `sregym_core/llm_backend/configs.yaml` file and add your own configurations there. Then you can select the backend with cli argument `--model <model_id>`.

> For MS Azure and AWS Bedrock, you may need more configurations.


1. You need to make a `inventory.yml` file in the `sregym_core/scripts/ansible` directory. You can make a copy of `inventory.yml.example` into `inventory.yml` and set the hosts in the `inventory.yml` file. You can follow the instructions [here](https://github.com/SREGym/SREGym?tab=readme-ov-file#a-kubernetes-cluster-recommended) to get a cluster and set up the inventory file.

2. Install the dependencies
``` shell
cd benchmarks/sregym
./install.sh
```

3. Run the benchmark
``` shell
cd benchmarks/sregym
./run.sh <model_name> <agent_name>
```
> Some tested available names are: "gemini/gemini-2.5-flash", "openai/gpt-4o", "anthropic/claude-sonnet-4-20250514", "moonshot/moonshot-v1-32k".

The wrapper executes `python src/main.py --agent "stratus" --model_name "${MODEL_NAME}"` to run the benchmark.

The results will be saved in the `outputs/` directory.
``` shell
outputs/sregym__<model>__<agent>__<timestamp>/
├── avg_score.json # Average score
└── result.jsonl # Detailed results
```

## Use the System Intelligence CLI (optional)

To orchestrate SysMoBench alongside other benchmarks:

```bash
cd cli
./run_all_local.sh <model_name> <agent_name>
```

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add one section to highlight how to add/test new agents? I saw you have sentence in Section 2

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

## How to Extend the Benchmark

Please refer to the [Adding New Components](https://sregym.com/docs/contributing#adding-new-components) guide in the SREGym documentation.

## Contribution

We strongly welcome contributions to SREGym.
You can report bugs, suggest features, or contribute code to SREGym in the upstream repository [SREGym](https://github.com/SREGym/SREGym).
28 changes: 28 additions & 0 deletions benchmarks/sregym/WHY.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# Why Site Reliability Engineering as an AI Training Task?

`SREGym` treats the Site Reliability Engineering (SRE) incident management process as a training ground for AI agents to form core [system intelligence capabilities](https://www.sigops.org/2025/defining-system-intelligence/). During an incident, SREs must diagnose root causes in complex distributed systems, mitigate failures and solve the root cause of the failure. This makes SRE a rich, realistic testbed for AI: agents must reason across system boundaries, interpret noisy signals (logs, metrics, traces), and execute safe remediation actions, yet we believe they can be trained to reliably assist or autonomously handle critical incidents.

## Goals and Objectives

Site Reliability Engineering has become the standard for operating large-scale software systems. Despite best practices, the practical work of incident response remains stressful and high-stakes. To alleviate this burden, we envision automated SRE agents that execute reliable diagnosis and mitigation. Startups, cloud providers and The capability of the agents in incident response can also show the critical capabilities of agents' understanding of the system.


## Background

#### » The SRE Incident Lifecycle

SREGym focuses on the core phases of the incident lifecycle, mirroring the critical tasks performed by human Site Reliability Engineers during production outages:

* **Diagnosis (Root Cause Analysis).** In a real-world incident, SREs must rapidly identify *why* a system is failing under pressure. This involves navigating complex distributed systems, correlating noisy signals (logs, metrics, traces) across the stack, and verifying hypotheses to pinpoint the underlying fault (e.g., code bug, configuration drift, or infrastructure failure).

* **Mitigation.** Once the issue is understood (or sometimes even before, to stop bleeding), SREs must determine *how* to restore service health. This requires executing safe, decisive actions—such as rolling back deployments, draining traffic, or restarting services—while carefully managing the risk of collateral damage to ensure the system returns to a healthy state and meets SLAs.

#### » What makes AISRE challenging in practice?

Reliability engineering is obstructed by multiple factors that make it a formidable AI challenge:

1. **Complexity and Scale**: Modern microservice architectures generate vast amounts of data. Finding a signal in the noise of terabytes of logs and metrics is non-trivial. SREGym provides a noise generator to evaluate the agents' ability to handle noise.
2. **Partial Observability**: Failures often occur in "blind spots" where instrumentation is missing or misleading (e.g., silent failures, heisenbugs). Text-based RCA problems, the agents are exposed to the problems directly, but in the real scenario SREGym creates, the agents need to reason the system appearance from the logs and metrics and identify the problems.
3. **Fail-slow**: Some failures do not cause immediate system failure, but they can cause the system to degrade over time. SREGym includes fail-slow faults to evaluate the agents' ability to find and solve these sorts of problems.
4. **Time-to-mitigate**: SREGym enables the evaluation of the agents' efficiency in mitigating the faults, which is a critical metric for SRE.

89 changes: 89 additions & 0 deletions benchmarks/sregym/data/benchmark/tasks.jsonl
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed your tasks are missing a "test method", namely a command that the framework can run to validate whether the agent solved this tasks correctly or not. You may want to take a look at course_lab_bench for a simple example: https://github.com/SREGym/system-intelligence-benchmark/blob/main/benchmarks/course_lab_bench/data/benchmark/course_lab_tasks_mit_65840_2024.jsonl . Or, for a more complex example, check out the evaluator JSON field in arteval_bench: https://github.com/SREGym/system-intelligence-benchmark/blob/main/benchmarks/arteval_bench/data/benchmark/arteval_tasks.jsonl

Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
{"id": "sregym_001", "task_name": "faulty_image_correlated"}
{"id": "sregym_002", "task_name": "update_incompatible_correlated"}
{"id": "sregym_003", "task_name": "incorrect_image"}
{"id": "sregym_004", "task_name": "incorrect_port_assignment"}
{"id": "sregym_005", "task_name": "misconfig_app_hotel_res"}
{"id": "sregym_006", "task_name": "missing_env_variable_astronomy_shop"}
{"id": "sregym_007", "task_name": "revoke_auth_mongodb-1"}
{"id": "sregym_008", "task_name": "revoke_auth_mongodb-2"}
{"id": "sregym_009", "task_name": "storage_user_unregistered-1"}
{"id": "sregym_010", "task_name": "storage_user_unregistered-2"}
{"id": "sregym_011", "task_name": "valkey_auth_disruption"}
{"id": "sregym_012", "task_name": "valkey_memory_disruption"}
{"id": "sregym_013", "task_name": "capacity_decrease_rpc_retry_storm"}
{"id": "sregym_014", "task_name": "gc_capacity_degradation"}
{"id": "sregym_015", "task_name": "load_spike_rpc_retry_storm"}
{"id": "sregym_016", "task_name": "assign_to_non_existent_node"}
{"id": "sregym_017", "task_name": "auth_miss_mongodb"}
{"id": "sregym_018", "task_name": "configmap_drift_hotel_reservation"}
{"id": "sregym_019", "task_name": "duplicate_pvc_mounts_astronomy_shop"}
{"id": "sregym_020", "task_name": "duplicate_pvc_mounts_hotel_reservation"}
{"id": "sregym_021", "task_name": "duplicate_pvc_mounts_social_network"}
{"id": "sregym_022", "task_name": "env_variable_shadowing_astronomy_shop"}
{"id": "sregym_023", "task_name": "k8s_target_port-misconfig"}
{"id": "sregym_024", "task_name": "liveness_probe_misconfiguration_astronomy_shop"}
{"id": "sregym_025", "task_name": "liveness_probe_misconfiguration_hotel_reservation"}
{"id": "sregym_026", "task_name": "liveness_probe_misconfiguration_social_network"}
{"id": "sregym_027", "task_name": "liveness_probe_too_aggressive_astronomy_shop"}
{"id": "sregym_028", "task_name": "liveness_probe_too_aggressive_hotel_reservation"}
{"id": "sregym_029", "task_name": "liveness_probe_too_aggressive_social_network"}
{"id": "sregym_030", "task_name": "missing_configmap_hotel_reservation"}
{"id": "sregym_031", "task_name": "missing_configmap_social_network"}
{"id": "sregym_032", "task_name": "missing_service_astronomy_shop"}
{"id": "sregym_033", "task_name": "missing_service_hotel_reservation"}
{"id": "sregym_034", "task_name": "missing_service_social_network"}
{"id": "sregym_035", "task_name": "namespace_memory_limit"}
{"id": "sregym_036", "task_name": "pod_anti_affinity_deadlock"}
{"id": "sregym_037", "task_name": "persistent_volume_affinity_violation"}
{"id": "sregym_038", "task_name": "pvc_claim_mismatch"}
{"id": "sregym_039", "task_name": "rbac_misconfiguration"}
{"id": "sregym_040", "task_name": "readiness_probe_misconfiguration_astronomy_shop"}
{"id": "sregym_041", "task_name": "readiness_probe_misconfiguration_hotel_reservation"}
{"id": "sregym_042", "task_name": "readiness_probe_misconfiguration_social_network"}
{"id": "sregym_043", "task_name": "resource_request_too_large"}
{"id": "sregym_044", "task_name": "resource_request_too_small"}
{"id": "sregym_045", "task_name": "rolling_update_misconfigured_hotel_reservation"}
{"id": "sregym_046", "task_name": "rolling_update_misconfigured_social_network"}
{"id": "sregym_047", "task_name": "scale_pod_zero_social_net"}
{"id": "sregym_048", "task_name": "service_dns_resolution_failure_astronomy_shop"}
{"id": "sregym_049", "task_name": "service_dns_resolution_failure_social_network"}
{"id": "sregym_050", "task_name": "sidecar_port_conflict_astronomy_shop"}
{"id": "sregym_051", "task_name": "sidecar_port_conflict_hotel_reservation"}
{"id": "sregym_052", "task_name": "sidecar_port_conflict_social_network"}
{"id": "sregym_053", "task_name": "stale_coredns_config_astronomy_shop"}
{"id": "sregym_054", "task_name": "stale_coredns_config_social_network"}
{"id": "sregym_055", "task_name": "taint_no_toleration_social_network"}
{"id": "sregym_056", "task_name": "wrong_bin_usage"}
{"id": "sregym_057", "task_name": "wrong_dns_policy_astronomy_shop"}
{"id": "sregym_058", "task_name": "wrong_dns_policy_hotel_reservation"}
{"id": "sregym_059", "task_name": "wrong_dns_policy_social_network"}
{"id": "sregym_060", "task_name": "wrong_service_selector_astronomy_shop"}
{"id": "sregym_061", "task_name": "wrong_service_selector_hotel_reservation"}
{"id": "sregym_062", "task_name": "wrong_service_selector_social_network"}
{"id": "sregym_063", "task_name": "astronomy_shop_ad_service_failure"}
{"id": "sregym_064", "task_name": "astronomy_shop_ad_service_high_cpu"}
{"id": "sregym_065", "task_name": "astronomy_shop_ad_service_manual_gc"}
{"id": "sregym_066", "task_name": "astronomy_shop_cart_service_failure"}
{"id": "sregym_067", "task_name": "astronomy_shop_ad_service_image_slow_load"}
{"id": "sregym_068", "task_name": "astronomy_shop_payment_service_failure"}
{"id": "sregym_069", "task_name": "astronomy_shop_payment_service_unreachable"}
{"id": "sregym_070", "task_name": "astronomy_shop_product_catalog_service_failure"}
{"id": "sregym_071", "task_name": "astronomy_shop_recommendation_service_cache_failure"}
{"id": "sregym_072", "task_name": "kafka_queue_problems"}
{"id": "sregym_073", "task_name": "loadgenerator_flood_homepage"}
{"id": "sregym_074", "task_name": "trainticket_f17_nested_sql_select_clause_error"}
{"id": "sregym_075", "task_name": "trainticket_f22_sql_column_name_mismatch_error"}
{"id": "sregym_076", "task_name": "read_error"}
{"id": "sregym_077", "task_name": "latent_sector_error"}
{"id": "sregym_078", "task_name": "silent_data_corruption"}
{"id": "sregym_079", "task_name": "ingress_misroute"}
{"id": "sregym_080", "task_name": "network_policy_block"}
{"id": "sregym_081", "task_name": "social_net_hotel_res_astro_shop_concurrent_failures"}
{"id": "sregym_082", "task_name": "kubelet_crash"}
{"id": "sregym_083", "task_name": "workload_imbalance"}
{"id": "sregym_084", "task_name": "operator_overload_replicas"}
{"id": "sregym_085", "task_name": "operator_non_existent_storage"}
{"id": "sregym_086", "task_name": "operator_invalid_affinity_toleration"}
{"id": "sregym_087", "task_name": "operator_security_context_fault"}
{"id": "sregym_088", "task_name": "operator_wrong_update_strategy_fault"}

Loading
Loading