Google Research: LLM User Simulators Are Too Cooperative

On April 9, 2026, Google Research released ConvApparel, a new benchmark dataset and evaluation framework designed to measure the realism gap in Large Language Model user simulators. Simulated users frequently act with unnatural patience and encyclopedic knowledge. This creates an artificial environment where agents succeed in testing but fail when encountering actual human frustration or ambiguity.

The Realism Gap in User Simulation

Modern Conversational Recommender Systems require extensive testing before deployment. Live human testing is difficult to scale. Developers use LLMs to simulate user traffic instead.

These simulators default to highly cooperative behavior. They answer questions perfectly, never misunderstand prompts, and never grow annoyed. This “easy mode” environment prevents conversational agents from learning how to handle suboptimal real-world interactions.

ConvApparel Dataset Specifications

ConvApparel provides ground truth data for how humans actually react to system failures. The dataset contains over 4,000 multi-turn human-AI conversations focused on the apparel domain. These interactions total nearly 15,000 conversational turns. The corpus extends the Amazon Reviews ‘23 dataset across categories like Clothing, Shoes and Jewelry, Sports and Outdoors, and Video Games.

Dual-Agent Protocol: Human participants were routed to either a helpful “Good” agent or a deliberately unhelpful “Bad” agent. This forced the collection of realistic negative reactions. The dataset includes turn-by-turn annotations of internal user states. These annotations provide explicit labels for user satisfaction and frustration levels.

The Three-Pillar Evaluation Framework

ConvApparel utilizes a hybrid protocol to quantify how closely a synthetic user mirrors a human. If you build and test AI agents, this methodology provides a blueprint for stress-testing conversational flows.

Statistical Alignment: This metric measures the quantitative overlap between simulator behavior and real human data.

Human-Likeness Score: A discriminator-based metric distinguishes between generated synthetic responses and actual human text.

Counterfactual Validation: Simulators trained exclusively on optimal interactions are tested against suboptimal system responses. This tests the simulator’s ability to extrapolate annoyance and adapt to unseen conversational failures.

Training Methods and Simulator Fidelity

Data-driven simulators trained on human interaction data demonstrate superior performance over prompt-based simulators. Pure prompt-based models struggle to adapt to novel, frustrating conversational dynamics.

The highest fidelity simulators utilize Reinforcement Learning with iterative critique. This training approach significantly increases the human-likeness of the simulated user. Data-driven simulators successfully mirror human annoyance when confronted with unexpected agent failures. A measurable gap between human baselines and top-performing simulators still persists.

The ConvApparel dataset is available under a CC BY-SA 4.0 license. If you rely on LLMs to simulate user traffic for your conversational agents, pure prompt-based simulation is likely masking critical failure states. Integrating human frustration data and counterfactual validation into your testing pipeline ensures your agents learn to handle uncooperative users before reaching production.

Google Research: LLM User Simulators Are Too Cooperative

The Realism Gap in User Simulation

ConvApparel Dataset Specifications

The Three-Pillar Evaluation Framework

Training Methods and Simulator Fidelity

Keep Reading

How to Configure Sparse-LoRA and DoRA With Hugging Face PEFT

Google Finds Reasoning Tokens Expand LLM Parametric Recall

Google Research Finds Huge Gap in LLM Behavioral Alignment

SymptomAI Agentic Interviews Beat Clinician Diagnostic Accuracy

3T-Parameter Kimi 3 Narrows the MMLU Gap With Opus 4.8