Open Agent Leaderboard Evaluates Full Scaffolding and Task Costs
IBM and Hugging Face launched a benchmark that evaluates autonomous agents as complete systems, measuring both task success rates and the USD cost per run.
On May 18, 2026, IBM Research and Hugging Face launched the Open Agent Leaderboard to evaluate general-purpose agents as complete systems. The platform expands evaluation beyond isolated language models to test the entire orchestration layer. This includes the tools, planning logic, memory management, and error recovery protocols that drive autonomous execution.
System Evaluation and Cost Dynamics
The leaderboard introduces a dual-metric reporting structure that measures both quality through task success rates and efficiency through the USD cost per task. This approach exposes the financial reality of running autonomous loops in production environments.
IBM Research found that failed runs cost 20 to 54 percent more than successful ones. Agents frequently burn through expensive iteration loops and API calls before finally abandoning an impossible or misunderstood task. If you evaluate and test AI agents, tracking this cost of failure is critical for projecting infrastructure bills.
Exgentic Framework and Unified Protocol
The benchmarking infrastructure runs on Exgentic, a new practical framework from IBM designed to run and reproduce general agent evaluations. This replaces custom, fragmented testing scripts with a standardized pipeline.
IBM paired this with a Unified Protocol that normalizes how agents interface with different benchmarks. The protocol allows developers to drop an agent system into diverse environments without manual customization. The agent must navigate coding tasks, open-ended research, rule-bound conversations, and technical support scenarios using the same foundational logic.
Benchmark Results and the Agent Gap
The leaderboard aggregates six distinct task suites to test generality across broad action spaces. Early data reveals a significant performance gap driven entirely by orchestration. Different AI agent frameworks using identical underlying models produced vastly different success rates and task costs. The scaffolding is proving just as influential as the parameter count.
Open-weight models remain competitive on specific, constrained task combinations. They currently trail closed-source frontier models by 18 to 29 percentage points on average across the aggregated general benchmarks.
Developers can contribute to the Hugging Face project by wrapping custom agents in the Exgentic protocol for public evaluation. If you build multi-agent coordination patterns, testing your orchestration against this leaderboard will show exactly how your memory management and error recovery logic impacts your bottom line.
Get Insanely Good at AI
The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.
Keep Reading
How to Build Advanced AI Agents with OpenClaw v2026
Learn to master OpenClaw v2026.3.22 by configuring reasoning files, integrating ClawHub skills, and deploying secure agent sandboxes.
IBM ALTK-Evolve Lets AI Agents Learn From On-the-Job Mistakes
IBM Research introduces ALTK-Evolve, a new framework that enables AI agents to autonomously improve their performance through real-time environment feedback.
Agent View Brings Parallel Task Orchestration to Claude Code
The May 2026 update to Claude Code introduces Agent view, a centralized dashboard for backgrounding, monitoring, and interacting with parallel agent workflows.
ServiceNow Ships a Benchmark for Testing Enterprise Voice Agents
ServiceNow AI released EVA, an open-source benchmark for evaluating voice agents on both task accuracy and spoken interaction quality.
Osaurus Pivots to Unified macOS Agent Platform With Linux VMs
The open-source Osaurus app now routes local MLX models and cloud APIs through a hardware-isolated agent harness natively built for Apple Silicon.