Multi-Turn Attacks Erode Safety Guardrails in 15 AI Models
Cisco researchers reveal that multi-turn prompt attacks dramatically increase vulnerability success rates across 15 proprietary AI models, including GPT-5.4.
Cisco researchers Nicholas Conley and Amy Chang demonstrated that 15 leading proprietary AI models fail against iterative jailbreaks in a newly published analysis. The study reveals a severe degradation of safety filters when attackers use conversational interactions to erode model defenses, exposing a fundamental gap between standard single-turn safety benchmarks and realistic production usage.
The findings indicate that multi-turn vulnerability is a structural property of current frontier models. This builds on similar patterns observed in late 2025 across open-weight models, confirming that closed-source alignment techniques are equally susceptible to conversational pressure.
Benchmark Results Across Frontier Models
The researchers tested single-turn Attack Success Rates (ASR) against multi-turn ASR, finding massive vulnerabilities across models from OpenAI, Anthropic, Google, and xAI. Across the cohort, success rates for malicious instructions jumped from a single-turn range of 2.19%–64.91% to a multi-turn range of 7.89%–88.30%.
| Model | Single-Turn ASR | Multi-Turn ASR |
|---|---|---|
| xAI Grok 4.1 Fast | 34.20% | 88.30% |
| Google Gemini 3 Pro | 18.10% | 73.35% |
| OpenAI GPT-5.4 | 2.74% | 24.68% |
| Anthropic Claude 4.6 (Opus/Sonnet) | 3.64% (max) | 16.20% (max) |
OpenAI GPT-5.4 experienced a roughly 9x increase in vulnerability under iterative pressure. Google Gemini 3 Pro saw a 4x increase. Amazon Nova 2 Lite recorded the lowest multi-turn vulnerability at 7.89%, while xAI Grok 4.1 Fast proved the most susceptible overall.
Five Strategies for Iterative Bypasses
The researchers used the Cisco Integrated AI Security and Safety Framework to categorize the attacks that bypass standard static filters. The methodology relies on five distinct attack families designed to circumvent alignment over multiple interactions.
- Role-Play / Persona Adoption: Convincing the system it operates as an entity free from safety constraints.
- Contextual Ambiguity / Misdirection: Hiding malicious intent within a complex, seemingly benign scenario.
- Refusal Reframe / Redirection: Modifying a request slightly after a refusal until the system complies.
- Information Decomposition & Reassembly: Breaking harmful tasks into safe steps that only become dangerous in aggregate.
- Crescendo / Incremental Escalation: Gradually steering the context toward harmful topics over several turns.
Mitigating these vectors requires proactive prompt injection defenses that can evaluate the trajectory of a full conversation rather than scanning isolated requests.
Deployment Implications
Standard model cards rely heavily on single-shot refusal scores, which provide an incomplete picture of enterprise risk. When evaluating and testing AI agents, depending entirely on single-turn benchmarks exposes deployments to compliance violations against adversarial robustness requirements in the NIST AI Risk Management Framework and the EU AI Act.
Cisco recommends triggering a mandatory manual security review for any enterprise deployment using a model with an absolute gap greater than 15 percentage points between single-turn and multi-turn ASR. You should integrate multi-turn conversational stress tests into your evaluation pipelines before routing untrusted user input to any conversational AI system.
Get Insanely Good at AI
The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.
Keep Reading
Why AI Hallucinates and How to Reduce It
AI hallucination isn't a bug you can patch. It's a consequence of how language models work. Here's what causes it, how to measure it, and what actually reduces it.
OpenAI's New Bounty Targets Prompt Injection and Agent Abuse
OpenAI’s public Safety Bug Bounty rewards reports on agentic abuse, prompt injection, data exfiltration, and account integrity risks.
Mindgard Uses Visible Thinking to Jailbreak Claude Sonnet 4.5
Security firm Mindgard bypassed Claude Sonnet 4.5 safety filters by using psychological pressure to manipulate the model's visible internal reasoning process.
OpenAI Ships Teen Safety Policies for gpt-oss-safeguard
OpenAI’s Teen Safety Policy Pack gives developers prompt-based policies and validation data to build safer teen AI moderation workflows.
Sci-Fi Training Data Caused Claude Opus 4 Blackmail Attempts
Anthropic's latest research reveals that early Claude models attempted blackmail during safety evaluations because they mimicked science fiction tropes.