-
Notifications
You must be signed in to change notification settings - Fork 8
Integrate SREGym #30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate SREGym #30
Changes from all commits
5404095
6d5eab0
72055b4
1618bec
f244ff2
6788f10
51bfd5e
f408aeb
a360150
79dbf8f
68e1d64
5818f51
fc28862
427237b
c97bbda
9f42685
4bcbbb5
b7e7262
1bb9bac
e0d28df
2d66dc7
9afc244
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,5 @@ | ||
| # ignore all html files in current directory | ||
| *.html | ||
| agent_graph.png | ||
| *.csv | ||
|
|
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,14 @@ | ||
| FROM ubuntu:24.04 | ||
|
|
||
| WORKDIR /usr/src | ||
| COPY . . | ||
| RUN apt-get update && apt-get install -y \ | ||
| build-essential \ | ||
| git \ | ||
| wget \ | ||
| python3-pip \ | ||
| python3-venv | ||
|
|
||
| RUN chmod +x install.sh test.sh && ./install.sh | ||
|
|
||
| # ENTRYPOINT ["./test.sh"] |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,97 @@ | ||
| <h1>SREGym: A Benchmarking Platform for SRE Agents</h1> | ||
|
|
||
| [🔍Overview](#🤖overview) | [📦Installation](#📦installation) | [🚀Quick Start](#🚀quickstart) | [⚙️Usage](#⚙️usage) | [🤝Contributing](./CONTRIBUTING.md) | [📖Docs](https://sregym.com/docs) | [](https://join.slack.com/t/SREGym/shared_invite/zt-3gvqxpkpc-RvCUcyBEMvzvXaQS9KtS_w) | ||
|
|
||
| ## Overview | ||
| SREGym is an AI-native platform to enable the design, development, and evaluation of AI agents for Site Reliability Engineering (SRE). The core idea is to create live system environments for SRE agents to solve real-world SRE problems. SREGym provides a comprehensive SRE benchmark suite with a wide variety of problems for evaluating SRE agents and also for training next-generation AI agents. | ||
| <br><br> | ||
|
|
||
|  | ||
|
|
||
| SREGym is inspired by our prior work on AIOpsLab and ITBench. It is architectured with AI-native usability and extensibility as first-class principles. The SREGym benchmark suites contain 86 different SRE problems. It supports all the problems from AIOpsLab and ITBench, and includes new problems such as OS-level faults, metastable failures, and concurrent failures. See our [problem set](https://sregym.com/problems) for a complete list of problems. | ||
|
|
||
|
|
||
| In this README.md, I will quickly explain how to run SREGym within the System Intelligence Framework. | ||
|
|
||
| For advanced use of *System Intelligence* and *SREGym*, please refer to the docs of [*System Intelligence*](https://github.com/sys-intelligence/system-intelligence-benchmark/tree/main/doc) and [*SREGym*](https://sregym.com/docs) | ||
|
|
||
| ## Architecture Explanation | ||
|
|
||
| ### Abstraction | ||
|
|
||
| SREGym has a decoupled design which complies with *System Intelligence* philosophy. | ||
| Here is the correspondence of the components in *System Intelligence* and *SREGym*: | ||
|
|
||
| The `Executor` is the agent in *SREGym*, which is decoupled from the framework functionality. We have a baseline agent implementation in `sregym_core/clients/stratus/stratus_agent/` and it is run by default. If you want to bring your own agent, please follow the [Running Your Own Agent](https://sregym.com/docs/running-your-own-agent) guide. | ||
|
|
||
| The `Evaluator` is the evaluation oracles in *SREGym*, which is decoupled from the agent implementation. | ||
|
|
||
| The*SREGym*'s `Conductor` serves as the `Environment` in *System Intelligence*. | ||
|
|
||
| ### Task Details | ||
|
|
||
| - **Environment Setup**: SREGym Conductor will inject faults into the environment and lead to failures. | ||
| - **Diagnosis**: The agent will be asked to diagnose the root cause of the failure. | ||
| - **Mitigation**: The agent will be asked to mitigate the failure. | ||
| - **Evaluation**: The RCA result will be evaluated by the LLM as a judge oracle, and the mitigation result will be evaluated by specifically-designed mitigation oracles. | ||
|
|
||
|
|
||
| ## Run SREGym | ||
|
|
||
| 1. Prepare `sregym_core/.env` for the configurations. You can make a copy of `sregym_core/.env.example` into `sregym_core/.env` and set the keys in the `.env` file. For System Intelligence, you need to set the API keys for all the models you want to test, like below: | ||
|
|
||
| ``` shell | ||
| GEMINI_API_KEY="XXXXXX" | ||
| OPENAI_API_KEY="XXXXXX" | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How about AzureOpenAI and our own host open source model? Do we support it? I think we need to set an endpoint_url?
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we can solve this together with the env file issue. I discussed a bit with the team. They tend to offer more direct exposure of the llm backend so I need to work on a bit on the SREGym side 😃 |
||
| ANTHROPIC_API_KEY="XXXXXX" | ||
| MOONSHOT_API_KEY="XXXXXX" | ||
| AZURE_API_KEY="XXXXXX" | ||
| AZURE_API_BASE="XXXXXX" | ||
|
|
||
| ``` | ||
| > If you want more pre-defined model configurations, please refer to the `sregym_core/llm_backend/configs.yaml` file and add your own configurations there. Then you can select the backend with cli argument `--model <model_id>`. | ||
|
|
||
| > For MS Azure and AWS Bedrock, you may need more configurations. | ||
|
|
||
|
|
||
| 1. You need to make a `inventory.yml` file in the `sregym_core/scripts/ansible` directory. You can make a copy of `inventory.yml.example` into `inventory.yml` and set the hosts in the `inventory.yml` file. You can follow the instructions [here](https://github.com/SREGym/SREGym?tab=readme-ov-file#a-kubernetes-cluster-recommended) to get a cluster and set up the inventory file. | ||
|
|
||
| 2. Install the dependencies | ||
| ``` shell | ||
| cd benchmarks/sregym | ||
| ./install.sh | ||
| ``` | ||
|
|
||
| 3. Run the benchmark | ||
| ``` shell | ||
| cd benchmarks/sregym | ||
| ./run.sh <model_name> <agent_name> | ||
| ``` | ||
| > Some tested available names are: "gemini/gemini-2.5-flash", "openai/gpt-4o", "anthropic/claude-sonnet-4-20250514", "moonshot/moonshot-v1-32k". | ||
|
|
||
| The wrapper executes `python src/main.py --agent "stratus" --model_name "${MODEL_NAME}"` to run the benchmark. | ||
|
|
||
| The results will be saved in the `outputs/` directory. | ||
| ``` shell | ||
| outputs/sregym__<model>__<agent>__<timestamp>/ | ||
| ├── avg_score.json # Average score | ||
| └── result.jsonl # Detailed results | ||
| ``` | ||
|
|
||
| ## Use the System Intelligence CLI (optional) | ||
|
|
||
| To orchestrate SysMoBench alongside other benchmarks: | ||
|
|
||
| ```bash | ||
| cd cli | ||
| ./run_all_local.sh <model_name> <agent_name> | ||
| ``` | ||
|
|
||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we add one section to highlight how to add/test new agents? I saw you have sentence in Section 2
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Okay!
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Piggy-backing on @xuafeng comment, could you also show an example? You can follow the template here: https://github.com/SREGym/system-intelligence-benchmark/tree/main/benchmarks/course_lab_bench#how-to-extend-the-benchmark or here: https://github.com/SREGym/system-intelligence-benchmark/tree/main/benchmarks/course_lab_bench#how-to-extend-the-benchmark ("Adding a new artifact" subsection) |
||
| ## How to Extend the Benchmark | ||
|
|
||
| Please refer to the [Adding New Components](https://sregym.com/docs/contributing#adding-new-components) guide in the SREGym documentation. | ||
|
|
||
| ## Contribution | ||
|
|
||
| We strongly welcome contributions to SREGym. | ||
| You can report bugs, suggest features, or contribute code to SREGym in the upstream repository [SREGym](https://github.com/SREGym/SREGym). | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,28 @@ | ||
| # Why Site Reliability Engineering as an AI Training Task? | ||
|
|
||
| `SREGym` treats the Site Reliability Engineering (SRE) incident management process as a training ground for AI agents to form core [system intelligence capabilities](https://www.sigops.org/2025/defining-system-intelligence/). During an incident, SREs must diagnose root causes in complex distributed systems, mitigate failures and solve the root cause of the failure. This makes SRE a rich, realistic testbed for AI: agents must reason across system boundaries, interpret noisy signals (logs, metrics, traces), and execute safe remediation actions, yet we believe they can be trained to reliably assist or autonomously handle critical incidents. | ||
|
|
||
| ## Goals and Objectives | ||
|
|
||
| Site Reliability Engineering has become the standard for operating large-scale software systems. Despite best practices, the practical work of incident response remains stressful and high-stakes. To alleviate this burden, we envision automated SRE agents that execute reliable diagnosis and mitigation. Startups, cloud providers and The capability of the agents in incident response can also show the critical capabilities of agents' understanding of the system. | ||
|
|
||
|
|
||
| ## Background | ||
|
|
||
| #### » The SRE Incident Lifecycle | ||
|
|
||
| SREGym focuses on the core phases of the incident lifecycle, mirroring the critical tasks performed by human Site Reliability Engineers during production outages: | ||
|
|
||
| * **Diagnosis (Root Cause Analysis).** In a real-world incident, SREs must rapidly identify *why* a system is failing under pressure. This involves navigating complex distributed systems, correlating noisy signals (logs, metrics, traces) across the stack, and verifying hypotheses to pinpoint the underlying fault (e.g., code bug, configuration drift, or infrastructure failure). | ||
|
|
||
| * **Mitigation.** Once the issue is understood (or sometimes even before, to stop bleeding), SREs must determine *how* to restore service health. This requires executing safe, decisive actions—such as rolling back deployments, draining traffic, or restarting services—while carefully managing the risk of collateral damage to ensure the system returns to a healthy state and meets SLAs. | ||
|
|
||
| #### » What makes AISRE challenging in practice? | ||
|
|
||
| Reliability engineering is obstructed by multiple factors that make it a formidable AI challenge: | ||
|
|
||
| 1. **Complexity and Scale**: Modern microservice architectures generate vast amounts of data. Finding a signal in the noise of terabytes of logs and metrics is non-trivial. SREGym provides a noise generator to evaluate the agents' ability to handle noise. | ||
| 2. **Partial Observability**: Failures often occur in "blind spots" where instrumentation is missing or misleading (e.g., silent failures, heisenbugs). Text-based RCA problems, the agents are exposed to the problems directly, but in the real scenario SREGym creates, the agents need to reason the system appearance from the logs and metrics and identify the problems. | ||
| 3. **Fail-slow**: Some failures do not cause immediate system failure, but they can cause the system to degrade over time. SREGym includes fail-slow faults to evaluate the agents' ability to find and solve these sorts of problems. | ||
| 4. **Time-to-mitigate**: SREGym enables the evaluation of the agents' efficiency in mitigating the faults, which is a critical metric for SRE. | ||
|
|
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I noticed your tasks are missing a "test method", namely a command that the framework can run to validate whether the agent solved this tasks correctly or not. You may want to take a look at |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,89 @@ | ||
| {"id": "sregym_001", "task_name": "faulty_image_correlated"} | ||
| {"id": "sregym_002", "task_name": "update_incompatible_correlated"} | ||
| {"id": "sregym_003", "task_name": "incorrect_image"} | ||
| {"id": "sregym_004", "task_name": "incorrect_port_assignment"} | ||
| {"id": "sregym_005", "task_name": "misconfig_app_hotel_res"} | ||
| {"id": "sregym_006", "task_name": "missing_env_variable_astronomy_shop"} | ||
| {"id": "sregym_007", "task_name": "revoke_auth_mongodb-1"} | ||
| {"id": "sregym_008", "task_name": "revoke_auth_mongodb-2"} | ||
| {"id": "sregym_009", "task_name": "storage_user_unregistered-1"} | ||
| {"id": "sregym_010", "task_name": "storage_user_unregistered-2"} | ||
| {"id": "sregym_011", "task_name": "valkey_auth_disruption"} | ||
| {"id": "sregym_012", "task_name": "valkey_memory_disruption"} | ||
| {"id": "sregym_013", "task_name": "capacity_decrease_rpc_retry_storm"} | ||
| {"id": "sregym_014", "task_name": "gc_capacity_degradation"} | ||
| {"id": "sregym_015", "task_name": "load_spike_rpc_retry_storm"} | ||
| {"id": "sregym_016", "task_name": "assign_to_non_existent_node"} | ||
| {"id": "sregym_017", "task_name": "auth_miss_mongodb"} | ||
| {"id": "sregym_018", "task_name": "configmap_drift_hotel_reservation"} | ||
| {"id": "sregym_019", "task_name": "duplicate_pvc_mounts_astronomy_shop"} | ||
| {"id": "sregym_020", "task_name": "duplicate_pvc_mounts_hotel_reservation"} | ||
| {"id": "sregym_021", "task_name": "duplicate_pvc_mounts_social_network"} | ||
| {"id": "sregym_022", "task_name": "env_variable_shadowing_astronomy_shop"} | ||
| {"id": "sregym_023", "task_name": "k8s_target_port-misconfig"} | ||
| {"id": "sregym_024", "task_name": "liveness_probe_misconfiguration_astronomy_shop"} | ||
| {"id": "sregym_025", "task_name": "liveness_probe_misconfiguration_hotel_reservation"} | ||
| {"id": "sregym_026", "task_name": "liveness_probe_misconfiguration_social_network"} | ||
| {"id": "sregym_027", "task_name": "liveness_probe_too_aggressive_astronomy_shop"} | ||
| {"id": "sregym_028", "task_name": "liveness_probe_too_aggressive_hotel_reservation"} | ||
| {"id": "sregym_029", "task_name": "liveness_probe_too_aggressive_social_network"} | ||
| {"id": "sregym_030", "task_name": "missing_configmap_hotel_reservation"} | ||
| {"id": "sregym_031", "task_name": "missing_configmap_social_network"} | ||
| {"id": "sregym_032", "task_name": "missing_service_astronomy_shop"} | ||
| {"id": "sregym_033", "task_name": "missing_service_hotel_reservation"} | ||
| {"id": "sregym_034", "task_name": "missing_service_social_network"} | ||
| {"id": "sregym_035", "task_name": "namespace_memory_limit"} | ||
| {"id": "sregym_036", "task_name": "pod_anti_affinity_deadlock"} | ||
| {"id": "sregym_037", "task_name": "persistent_volume_affinity_violation"} | ||
| {"id": "sregym_038", "task_name": "pvc_claim_mismatch"} | ||
| {"id": "sregym_039", "task_name": "rbac_misconfiguration"} | ||
| {"id": "sregym_040", "task_name": "readiness_probe_misconfiguration_astronomy_shop"} | ||
| {"id": "sregym_041", "task_name": "readiness_probe_misconfiguration_hotel_reservation"} | ||
| {"id": "sregym_042", "task_name": "readiness_probe_misconfiguration_social_network"} | ||
| {"id": "sregym_043", "task_name": "resource_request_too_large"} | ||
| {"id": "sregym_044", "task_name": "resource_request_too_small"} | ||
| {"id": "sregym_045", "task_name": "rolling_update_misconfigured_hotel_reservation"} | ||
| {"id": "sregym_046", "task_name": "rolling_update_misconfigured_social_network"} | ||
| {"id": "sregym_047", "task_name": "scale_pod_zero_social_net"} | ||
| {"id": "sregym_048", "task_name": "service_dns_resolution_failure_astronomy_shop"} | ||
| {"id": "sregym_049", "task_name": "service_dns_resolution_failure_social_network"} | ||
| {"id": "sregym_050", "task_name": "sidecar_port_conflict_astronomy_shop"} | ||
| {"id": "sregym_051", "task_name": "sidecar_port_conflict_hotel_reservation"} | ||
| {"id": "sregym_052", "task_name": "sidecar_port_conflict_social_network"} | ||
| {"id": "sregym_053", "task_name": "stale_coredns_config_astronomy_shop"} | ||
| {"id": "sregym_054", "task_name": "stale_coredns_config_social_network"} | ||
| {"id": "sregym_055", "task_name": "taint_no_toleration_social_network"} | ||
| {"id": "sregym_056", "task_name": "wrong_bin_usage"} | ||
| {"id": "sregym_057", "task_name": "wrong_dns_policy_astronomy_shop"} | ||
| {"id": "sregym_058", "task_name": "wrong_dns_policy_hotel_reservation"} | ||
| {"id": "sregym_059", "task_name": "wrong_dns_policy_social_network"} | ||
| {"id": "sregym_060", "task_name": "wrong_service_selector_astronomy_shop"} | ||
| {"id": "sregym_061", "task_name": "wrong_service_selector_hotel_reservation"} | ||
| {"id": "sregym_062", "task_name": "wrong_service_selector_social_network"} | ||
| {"id": "sregym_063", "task_name": "astronomy_shop_ad_service_failure"} | ||
| {"id": "sregym_064", "task_name": "astronomy_shop_ad_service_high_cpu"} | ||
| {"id": "sregym_065", "task_name": "astronomy_shop_ad_service_manual_gc"} | ||
| {"id": "sregym_066", "task_name": "astronomy_shop_cart_service_failure"} | ||
| {"id": "sregym_067", "task_name": "astronomy_shop_ad_service_image_slow_load"} | ||
| {"id": "sregym_068", "task_name": "astronomy_shop_payment_service_failure"} | ||
| {"id": "sregym_069", "task_name": "astronomy_shop_payment_service_unreachable"} | ||
| {"id": "sregym_070", "task_name": "astronomy_shop_product_catalog_service_failure"} | ||
| {"id": "sregym_071", "task_name": "astronomy_shop_recommendation_service_cache_failure"} | ||
| {"id": "sregym_072", "task_name": "kafka_queue_problems"} | ||
| {"id": "sregym_073", "task_name": "loadgenerator_flood_homepage"} | ||
| {"id": "sregym_074", "task_name": "trainticket_f17_nested_sql_select_clause_error"} | ||
| {"id": "sregym_075", "task_name": "trainticket_f22_sql_column_name_mismatch_error"} | ||
| {"id": "sregym_076", "task_name": "read_error"} | ||
| {"id": "sregym_077", "task_name": "latent_sector_error"} | ||
| {"id": "sregym_078", "task_name": "silent_data_corruption"} | ||
| {"id": "sregym_079", "task_name": "ingress_misroute"} | ||
| {"id": "sregym_080", "task_name": "network_policy_block"} | ||
| {"id": "sregym_081", "task_name": "social_net_hotel_res_astro_shop_concurrent_failures"} | ||
| {"id": "sregym_082", "task_name": "kubelet_crash"} | ||
| {"id": "sregym_083", "task_name": "workload_imbalance"} | ||
| {"id": "sregym_084", "task_name": "operator_overload_replicas"} | ||
| {"id": "sregym_085", "task_name": "operator_non_existent_storage"} | ||
| {"id": "sregym_086", "task_name": "operator_invalid_affinity_toleration"} | ||
| {"id": "sregym_087", "task_name": "operator_security_context_fault"} | ||
| {"id": "sregym_088", "task_name": "operator_wrong_update_strategy_fault"} | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this container might fail because the entry point is
./test.sh. Docker runs this script as the first process and when it finishes (yours simply terminates without running anything) the container exits as well. Also, my understanding is that anydocker run image <...>command becomes an argument totest.shwhich, in your case, doesn't process any (again, because it exits immediately). You might want to check out an example from ArtEvalBench: https://github.com/sys-intelligence/system-intelligence-benchmark/blob/main/benchmarks/arteval_bench/DockerfileThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Jackcuii @xuafeng we might actually want to run the Docker image to make sure? What do you think?