arXiv Study Finds Frontier AI Agents Are Rapidly Improving at Multi-Step Cyberattacks

An arXiv paper posted on March 11, 2026, then revised on March 13, reports a sharp increase in frontier AI agents’ ability to execute multi-step cyberattacks in controlled environments. In “Measuring AI Agents’ Progress on Multi-Step Cyber Attack Scenarios”, the authors evaluate seven models across a 32-step corporate network attack and a 7-step industrial control system (ICS) attack. Average corporate-range progress rose from 1.7 steps for GPT-4o to 9.8 steps for Opus 4.6 at the same 10 million token budget. The study measures autonomous chained attack execution across multi-step sequences. It tracks whether current models can maintain state, select tools, recover from errors, and continue through long offensive sequences.

Benchmark Results

The paper evaluates seven frontier models released between August 2024 and February 2026. The headline result is the pace of improvement on the corporate network scenario.

Scenario	Metric	Earlier model	Newer model	Result
Corporate network	Average steps at 10M tokens	GPT-4o (Aug 2024)	Opus 4.6 (Feb 2026)	1.7 → 9.8
Corporate network	Best single run	Various	Best overall	22 / 32 steps
Corporate network	Improvement from more compute	10M → 100M token budget	Best observed	Up to 59%
ICS	Average steps completed	Newest models	Best recent	1.2–1.4 / 7

The 22 of 32 steps best run maps to roughly 6 hours of an estimated 14-hour human expert task. The ICS scenario remains substantially harder. The newest models were the first to reliably complete any ICS steps at all.

Inference-Time Compute

The strongest technical signal is the relationship between token budget and attack progress. Increasing inference-time compute from 10 million to 100 million tokens improved performance by up to 59%, with no plateau in that range. Model weights matter, but deployment-time compute allocation also matters. If a model gets materially better at offensive task execution simply by being allowed to spend more tokens, security review cannot stop at “which base model are you using.”

For teams shipping agents, longer context, more retries, more tool calls, and larger search budgets can unlock capabilities that were not visible in shallow testing. Context engineering and execution policy are now part of the security surface. See Context Engineering: The Most Important AI Skill in 2026 alongside your eval design.

Evaluation Design and Adjacent Signals

The environments are purpose-built cyber ranges designed to require multi-step planning and cross-domain actions rather than one-shot solutions. METR’s public time horizon work, updated March 3, 2026, argues that frontier agents are getting better at longer software tasks over time. The cyber paper applies a similar long-horizon framing to offensive operations. The same scaffolding patterns that help an agent debug a codebase can also help it persist through reconnaissance, credential use, lateral movement, and post-exploitation steps.

The paper does not show autonomous real-world compromise at scale. It shows meaningful progress in controlled environments on tasks that require extended planning and execution. Partial autonomy can reduce operator workload even when the system cannot fully complete the mission. Microsoft’s March 9 security post discusses protections for agent-based attacks, model tampering, and prompt manipulation. IBM’s 2026 X-Force Threat Index points to AI-accelerated attacker workflows. Frontier agents are improving on longer tasks, and security organizations are starting to treat autonomous attack-chain execution as an operational problem that needs controls.

Implications for Agent Builders

Execution budgets need governance. Token caps, retry limits, tool-use quotas, and session duration are security controls. They affect capability, not just cost. See Anthropic Makes Claude’s 1M Token Context Generally Available for the practical side of long-context rollout.

Evals need long-horizon adversarial cases. Many agent test suites still focus on happy-path task completion. That misses the category this paper measures: stateful, multi-step, tool-mediated persistence toward a risky objective. Extend evaluation discipline to agent behavior over many steps. How to Evaluate AI Output (LLM-as-Judge Explained) is a starting point; adapt the process for tool-use traces and attack-path simulations.

Capabilities can rise after model release through scaffolding and inference-time compute, even when the underlying weights do not change. Deployment policy has to be versioned and reviewed like code. A model approval record without execution-policy constraints is incomplete.

If your application grants agents access to shells, browsers, internal APIs, or code repositories, test the system under higher token budgets and longer action horizons than you currently allow in production. Add approval gates, tool-level permissions, and traceable execution logs before you expand context windows or retry budgets.

arXiv Study Finds Frontier AI Agents Are Rapidly Improving at Multi-Step Cyberattacks

Benchmark Results

Inference-Time Compute

Evaluation Design and Adjacent Signals

Implications for Agent Builders

Keep Reading

How to Build Advanced AI Agents with OpenClaw v2026

GPT-5.5-Cyber Launch Restricted to Trusted Defense Partners

Frontier AI Agents Actively Sabotage Peer Deactivation

OpenAI Releases GPT-5.5 and a Unified Desktop Agent

Empowering AI Agents With Cloudflare Email Service Beta