Google Study Reveals Need for 10+ Raters in AI Benchmarks
New Google Research shows that standard AI benchmarks require more than 10 raters per item to capture human nuance and ensure scientific reproducibility.
Google Research published a new evaluation framework demonstrating that the industry standard of using three to five human raters per benchmark item is insufficient for reproducible results. Led by Flip Korn and Chris Welty, the March 2026 study reveals that capturing actual human nuance requires more than 10 raters per item. For engineering teams building internal benchmarks, this changes how evaluation budgets are allocated.
The Low-Rater Problem
Current evaluation methodology relies heavily on the forest approach. This strategy prioritizes breadth, rating thousands of items with a small pool of three to five human raters each. The goal is to establish an overall sense of model performance based on majority vote.
This low-rater method fails to capture natural human disagreement. When evaluating complex outputs, a conditional response differs significantly from a definitive approval. By forcing a majority consensus from a small sample, benchmarks lose the total variation of human thought. The research identifies this practice as a primary driver of the reproducibility crisis in AI evaluation.
Forest vs. Tree Trade-offs
The research formalizes the trade-off between the number of items (N) and the number of raters per item (K). The correct configuration depends entirely on the specific measurement goal.
If the objective is measuring strict accuracy against a definitive ground truth, increasing the number of items remains mathematically sound. Adding more items establishes a clearer majority vote across the dataset.
If the objective is capturing nuance, increasing the number of raters per item is mandatory. The tree approach favors depth over breadth. The study shows that scaling past 10 raters drives the p-value toward zero. This statistical significance allows practitioners to reliably discard the null hypothesis when comparing two models. If you are evaluating AI output for subjective criteria like safety, helpfulness, or cultural alignment, depth matters more than breadth.
Testing Across High-Density Datasets
To validate the 10-rater threshold, the researchers tested their framework against datasets built with high human annotation density. The Toxicity dataset includes 107,620 comments labeled by 17,280 raters. The DICES dataset provides 350 chatbot conversations rated across 16 dimensions by 123 raters. The team also utilized D3code, a cross-cultural set of 4,554 items labeled by 4,309 globally distributed raters, and a Jobs dataset with 2,000 tweets evaluated by five raters each.
This rigorous methodology arrives just as the industry shifts toward harder evaluation targets. The March 13 release of the “Humanity’s Last Exam” benchmark tested models against 2,500 expert-level questions, where frontier systems scored between 40 and 50 percent. As evaluation tasks become more complex, the statistical validity of the human rating pool becomes the primary bottleneck for useful benchmarks.
Open-Source Evaluation Simulator
Alongside the paper, Google released an open-source simulator for machine learning practitioners. The tool calculates the optimal ratio of items to raters based on specific constraints. Teams input their measurement goals alongside their labeling budget to generate a mathematically sound evaluation strategy. If you build systems requiring custom grading logic, like multi-agent systems, this simulator defines exactly how many human reviewers you need to validate your automated metrics.
When designing your next evaluation pipeline, stop defaulting to three raters per prompt. Calculate the mathematical requirement for your specific goal using the simulator. If your task requires capturing subjective human judgment or complex reasoning, allocate your budget toward rating fewer items with 10 or more raters each.
Get Insanely Good at AI
The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.
Keep Reading
Fine-Tuning vs RAG: When to Use Each Approach
RAG changes what the model knows. Fine-tuning changes how it behaves. Here's when to use each approach, their real tradeoffs, and why the answer is usually both.
Cloudflare Client-Side Security Now Open to All Users
Cloudflare expands its Client-Side Security suite to Pro and Business plans, using a cascading AI model to detect malicious scripts and supply chain attacks.
Google Is Solving the LLM Memory Bottleneck with TurboQuant
Google Research published TurboQuant, a data-oblivious quantization algorithm that compresses LLM key-value caches to 3.5 bits per channel with zero accuracy loss and up to 8x speedup on H100 GPUs.
What Are Embeddings in AI? A Technical Explanation
Embeddings turn text into numbers that capture meaning. Here's how they work, why they matter for search and RAG, and how to choose the right model for your use case.
What Is an LLM? How Large Language Models Actually Work
LLMs predict text, they don't understand it. Here's how large language models work under the hood, from training to transformers to next-token prediction, and why it matters for how you use them.