Ai Agents 3 min read

How to Automate Agent Evaluation With Google Quality Flywheel

Learn how to configure Google's new Agent Quality Flywheel skill to automate evaluation, grading, and prompt optimization for your AI coding agents.

Google’s new Agent Quality Flywheel automates the evaluation and optimization loops for coding agents. Instead of manually inspecting traces, you can install a skill that pulls OpenTelemetry data, grades trajectories using DeepMind-trained AutoRaters, and executes logic tweaks to fix regressions. This guide covers how to set up the flywheel, configure the five-stage pipeline, and integrate it with your existing development environment.

Installation and Setup

You deploy the flywheel by adding a specific skill to your agent environment. Google provides two installation paths depending on whether you are using their native tooling or a third-party framework.

For developers building with the Google Agent Development Kit (ADK) and the Antigravity platform, add the skill via the agents-cli:

bash npx skills add https://github.com/google/agents-cli —skill google-agents-cli-eval

If you are using a different framework but still want to route evaluations through the Gemini Enterprise Agent Platform, use the generic SDK integration:

bash npx skills add https://github.com/google/skills —skill agent-platform-eval-flywheel

Both commands inject the evaluation primitives into your agent’s manifest. Understanding what are agent skills is critical here, as this installation grants the agent the necessary permissions to read its own historical traces and write prompt optimizations back to its configuration.

The Five-Stage Evaluation Pipeline

Once installed, the flywheel executes a continuous five-stage loop. You can trigger this loop manually during local testing or set it to run continuously against production traffic.

1. Prepare Data

The system automatically builds evaluation datasets. It ingests your OpenTelemetry (OTel) traces and combines them with hand-crafted test cases. For edge cases, it utilizes a built-in User Simulator to synthesize adversarial or highly complex scenarios that your agent might rarely encounter in standard traffic.

2. Run Inference

The agent executes the prepared dataset to generate fresh traces. During this stage, the framework captures all sub-agent calls, tool executions, and state changes. If you build graph-based workflows using ADK for Go 2.0, the inference engine maps the entire execution graph for precise step-level debugging.

3. Grade Results

The flywheel relies on adaptive AutoRaters developed in collaboration with Google DeepMind. These model-based judges score the generated traces across two primary metrics: Task Success and Trajectory Quality. The AutoRaters support multi-turn evaluations and output both numerical scores and natural language explanations for their grading decisions.

4. Analyze Failures

Instead of flagging individual errors, the analysis engine clusters failures. This semantic grouping isolates systemic regressions, such as the agent consistently misunderstanding a specific API schema, rather than highlighting transient network timeouts.

5. Optimize Configuration

Based on the failure clusters, the flywheel executes targeted prompt or logic tweaks. It updates the agent’s system instructions or context window parameters to address the identified regressions, completing the automated improvement cycle.

Configuration Options and Limitations

The default configuration routes all AutoRater processing through the Gemini Enterprise Agent Platform. This offloads the heavy reasoning required to evaluate and test AI agents away from your local machine or primary production cluster.

There are inherent tradeoffs to automated evaluation. Running the User Simulator and AutoRaters consumes significant token volume, which can inflate costs if executed on every pull request. The clustering analysis in the fourth stage also requires a statistically significant volume of traces to accurately identify systemic issues. Low-traffic internal tools may generate too few failures for the semantic grouping to provide actionable insights.

For production deployments, configure the flywheel to sample a percentage of OTel traces rather than evaluating 100% of live traffic. You can adjust this sampling rate in the skill’s environment variables to balance optimization speed with API expenditure.

Get Insanely Good at AI

Get Insanely Good at AI

The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.

Keep Reading