Multi-Turn Attacks Erode Safety Guardrails in 15 AI Models

Cisco researchers Nicholas Conley and Amy Chang demonstrated that 15 leading proprietary AI models fail against iterative jailbreaks in a newly published analysis. The study reveals a severe degradation of safety filters when attackers use conversational interactions to erode model defenses, exposing a fundamental gap between standard single-turn safety benchmarks and realistic production usage.

The findings indicate that multi-turn vulnerability is a structural property of current frontier models. This builds on similar patterns observed in late 2025 across open-weight models, confirming that closed-source alignment techniques are equally susceptible to conversational pressure.

Benchmark Results Across Frontier Models

The researchers tested single-turn Attack Success Rates (ASR) against multi-turn ASR, finding massive vulnerabilities across models from OpenAI, Anthropic, Google, and xAI. Across the cohort, success rates for malicious instructions jumped from a single-turn range of 2.19%–64.91% to a multi-turn range of 7.89%–88.30%.

Model	Single-Turn ASR	Multi-Turn ASR
xAI Grok 4.1 Fast	34.20%	88.30%
Google Gemini 3 Pro	18.10%	73.35%
OpenAI GPT-5.4	2.74%	24.68%
Anthropic Claude 4.6 (Opus/Sonnet)	3.64% (max)	16.20% (max)

OpenAI GPT-5.4 experienced a roughly 9x increase in vulnerability under iterative pressure. Google Gemini 3 Pro saw a 4x increase. Amazon Nova 2 Lite recorded the lowest multi-turn vulnerability at 7.89%, while xAI Grok 4.1 Fast proved the most susceptible overall.

Five Strategies for Iterative Bypasses

The researchers used the Cisco Integrated AI Security and Safety Framework to categorize the attacks that bypass standard static filters. The methodology relies on five distinct attack families designed to circumvent alignment over multiple interactions.

Role-Play / Persona Adoption: Convincing the system it operates as an entity free from safety constraints.
Contextual Ambiguity / Misdirection: Hiding malicious intent within a complex, seemingly benign scenario.
Refusal Reframe / Redirection: Modifying a request slightly after a refusal until the system complies.
Information Decomposition & Reassembly: Breaking harmful tasks into safe steps that only become dangerous in aggregate.
Crescendo / Incremental Escalation: Gradually steering the context toward harmful topics over several turns.

Mitigating these vectors requires proactive prompt injection defenses that can evaluate the trajectory of a full conversation rather than scanning isolated requests.

Deployment Implications

Standard model cards rely heavily on single-shot refusal scores, which provide an incomplete picture of enterprise risk. When evaluating and testing AI agents, depending entirely on single-turn benchmarks exposes deployments to compliance violations against adversarial robustness requirements in the NIST AI Risk Management Framework and the EU AI Act.

Cisco recommends triggering a mandatory manual security review for any enterprise deployment using a model with an absolute gap greater than 15 percentage points between single-turn and multi-turn ASR. You should integrate multi-turn conversational stress tests into your evaluation pipelines before routing untrusted user input to any conversational AI system.

Multi-Turn Attacks Erode Safety Guardrails in 15 AI Models

Benchmark Results Across Frontier Models

Five Strategies for Iterative Bypasses

Deployment Implications

Keep Reading

Why AI Hallucinates and How to Reduce It

OpenAI's New Bounty Targets Prompt Injection and Agent Abuse

Mindgard Uses Visible Thinking to Jailbreak Claude Sonnet 4.5

OpenAI Ships Teen Safety Policies for gpt-oss-safeguard

Sci-Fi Training Data Caused Claude Opus 4 Blackmail Attempts