Frontier Agents Score Below 50% on SRE Task Benchmark

On May 27, 2026, Artificial Analysis and IBM Research released ITBench-AA, a rigorous evaluation suite for agentic enterprise IT tasks. The benchmark measures how well AI agents resolve complex, real-world operational challenges inside live computing environments. The inaugural release focuses on Site Reliability Engineering (SRE). Data from the initial runs shows every frontier model scoring below 50%, exposing a strict limit in current autonomous system capabilities.

If you evaluate and test AI agents for production workloads, this benchmark introduces a structural shift. ITBench-AA moves beyond static multiple-choice questions by using executable scripts and rubric-based scoring against 59 total tasks. Agents must navigate live Kubernetes incident snapshots, read system logs, and trace dependencies to locate root causes. To prevent data contamination, 19 of the tasks are entirely held-out from public datasets.

SRE Task Performance

The current leaderboard highlights a narrow spread among the most advanced models. Claude Opus 4.7 leads when utilizing its Adaptive Reasoning and Max Effort configurations, followed closely by OpenAI’s GPT-5.5.

Model	Score
Claude Opus 4.7 (Adaptive Reasoning, Max Effort)	47%
GPT-5.5 (xhigh)	46%
Qwen3.7 Max	42%
GLM-5.1 (Reasoning)	40%

The GLM-5.1 score establishes the high-water mark for open-weights models, matching the performance of Google’s proprietary Gemini 3.5 Flash high configuration. All of these models easily clear simpler coding evaluations like Terminal-Bench, highlighting the specific difficulty of multi-step diagnostic workflows.

Trajectory Efficiency and False Positives

A core finding from the research is an inverse relationship between step count and diagnostic accuracy. Longer task trajectories do not yield higher success rates. GPT-5.5 (xhigh) averages 31 turns per task to achieve its 46% score. In contrast, Gemini 3.1 Pro Preview generates an average of 83 turns but manages only 30% accuracy.

Models that take excessive steps fall victim to over-investigation. These agents frequently identify false positives, such as co-occurring symptoms or the explicit fault-injection mechanisms used to create the test environment. Instead of finding the underlying configuration error, the agent gets distracted by the noise of the live system. If you build multi-agent systems, this behavior suggests a need for stricter bounding rules and focused diagnostic sub-agents.

The AI Operating Model

IBM framed this release alongside its broader AI Operating Model strategy at Think 2026. The goal is to shift enterprise infrastructure away from fragmented pilots toward integrated, core operations. ITBench-AA functions as the verification layer for this shift, providing a standardized implementation to vet models before they touch live production environments.

The SRE suite represents just the first phase of this evaluation framework. Artificial Analysis and IBM plan to expand the benchmark into Financial Operations (FinOps) and Chief Information Security Officer (CISO) workflows.

When deploying AI for operational infrastructure today, cap the maximum step count in your agent loops to prevent recursive troubleshooting. You must enforce strict human-in-the-loop review for any write operations, as the current sub-50% accuracy rate makes fully autonomous incident resolution a significant risk to system stability.

Frontier Agents Score Below 50% on SRE Task Benchmark

SRE Task Performance

Trajectory Efficiency and False Positives

The AI Operating Model

Keep Reading

How to Control Agent Tool Execution via Genkit Middleware

Claude Microsoft 365 Add-Ins Unify Agent Context Across Apps

Claude Cowork brings sandboxed agent workflows to local desktops

Sierra Buys Fragment to Connect Agents to Databases

OpenAI Releases GPT-5.5 and a Unified Desktop Agent