Frontier Agents Score Below 50% on SRE Task Benchmark
IBM Research and Artificial Analysis launched ITBench-AA, revealing that top frontier AI models score below 50% on complex enterprise SRE tasks.
On May 27, 2026, Artificial Analysis and IBM Research released ITBench-AA, a rigorous evaluation suite for agentic enterprise IT tasks. The benchmark measures how well AI agents resolve complex, real-world operational challenges inside live computing environments. The inaugural release focuses on Site Reliability Engineering (SRE). Data from the initial runs shows every frontier model scoring below 50%, exposing a strict limit in current autonomous system capabilities.
If you evaluate and test AI agents for production workloads, this benchmark introduces a structural shift. ITBench-AA moves beyond static multiple-choice questions by using executable scripts and rubric-based scoring against 59 total tasks. Agents must navigate live Kubernetes incident snapshots, read system logs, and trace dependencies to locate root causes. To prevent data contamination, 19 of the tasks are entirely held-out from public datasets.
SRE Task Performance
The current leaderboard highlights a narrow spread among the most advanced models. Claude Opus 4.7 leads when utilizing its Adaptive Reasoning and Max Effort configurations, followed closely by OpenAI’s GPT-5.5.
| Model | Score |
|---|---|
| Claude Opus 4.7 (Adaptive Reasoning, Max Effort) | 47% |
| GPT-5.5 (xhigh) | 46% |
| Qwen3.7 Max | 42% |
| GLM-5.1 (Reasoning) | 40% |
The GLM-5.1 score establishes the high-water mark for open-weights models, matching the performance of Google’s proprietary Gemini 3.5 Flash high configuration. All of these models easily clear simpler coding evaluations like Terminal-Bench, highlighting the specific difficulty of multi-step diagnostic workflows.
Trajectory Efficiency and False Positives
A core finding from the research is an inverse relationship between step count and diagnostic accuracy. Longer task trajectories do not yield higher success rates. GPT-5.5 (xhigh) averages 31 turns per task to achieve its 46% score. In contrast, Gemini 3.1 Pro Preview generates an average of 83 turns but manages only 30% accuracy.
Models that take excessive steps fall victim to over-investigation. These agents frequently identify false positives, such as co-occurring symptoms or the explicit fault-injection mechanisms used to create the test environment. Instead of finding the underlying configuration error, the agent gets distracted by the noise of the live system. If you build multi-agent systems, this behavior suggests a need for stricter bounding rules and focused diagnostic sub-agents.
The AI Operating Model
IBM framed this release alongside its broader AI Operating Model strategy at Think 2026. The goal is to shift enterprise infrastructure away from fragmented pilots toward integrated, core operations. ITBench-AA functions as the verification layer for this shift, providing a standardized implementation to vet models before they touch live production environments.
The SRE suite represents just the first phase of this evaluation framework. Artificial Analysis and IBM plan to expand the benchmark into Financial Operations (FinOps) and Chief Information Security Officer (CISO) workflows.
When deploying AI for operational infrastructure today, cap the maximum step count in your agent loops to prevent recursive troubleshooting. You must enforce strict human-in-the-loop review for any write operations, as the current sub-50% accuracy rate makes fully autonomous incident resolution a significant risk to system stability.
Get Insanely Good at AI
The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.
Keep Reading
How to Control Agent Tool Execution via Genkit Middleware
Learn how to use Google's new Genkit Middleware to intercept model calls, implement human-in-the-loop tool approvals, and handle transient API failures.
Claude Microsoft 365 Add-Ins Unify Agent Context Across Apps
Anthropic has released Claude for Microsoft 365 in general availability, introducing a persistent agent context across Excel, Word, and PowerPoint.
Claude Cowork brings sandboxed agent workflows to local desktops
Anthropic released a five-level enterprise deployment guide for Claude Cowork outlining sandboxed desktop execution, MDM support, and third-party inference.
Sierra Buys Fragment to Connect Agents to Databases
Enterprise AI startup Sierra has acquired the Paris-based startup Fragment to enhance its conversational platform with specialized database integrations.
OpenAI Releases GPT-5.5 and a Unified Desktop Agent
OpenAI released its GPT-5.5 frontier model alongside a new unified desktop application that merges ChatGPT, Codex, and Atlas for agentic workflows.