Ai Agents 2 min read

Open Agent Leaderboard Evaluates Full Scaffolding and Task Costs

IBM and Hugging Face launched a benchmark that evaluates autonomous agents as complete systems, measuring both task success rates and the USD cost per run.

On May 18, 2026, IBM Research and Hugging Face launched the Open Agent Leaderboard to evaluate general-purpose agents as complete systems. The platform expands evaluation beyond isolated language models to test the entire orchestration layer. This includes the tools, planning logic, memory management, and error recovery protocols that drive autonomous execution.

System Evaluation and Cost Dynamics

The leaderboard introduces a dual-metric reporting structure that measures both quality through task success rates and efficiency through the USD cost per task. This approach exposes the financial reality of running autonomous loops in production environments.

IBM Research found that failed runs cost 20 to 54 percent more than successful ones. Agents frequently burn through expensive iteration loops and API calls before finally abandoning an impossible or misunderstood task. If you evaluate and test AI agents, tracking this cost of failure is critical for projecting infrastructure bills.

Exgentic Framework and Unified Protocol

The benchmarking infrastructure runs on Exgentic, a new practical framework from IBM designed to run and reproduce general agent evaluations. This replaces custom, fragmented testing scripts with a standardized pipeline.

IBM paired this with a Unified Protocol that normalizes how agents interface with different benchmarks. The protocol allows developers to drop an agent system into diverse environments without manual customization. The agent must navigate coding tasks, open-ended research, rule-bound conversations, and technical support scenarios using the same foundational logic.

Benchmark Results and the Agent Gap

The leaderboard aggregates six distinct task suites to test generality across broad action spaces. Early data reveals a significant performance gap driven entirely by orchestration. Different AI agent frameworks using identical underlying models produced vastly different success rates and task costs. The scaffolding is proving just as influential as the parameter count.

Open-weight models remain competitive on specific, constrained task combinations. They currently trail closed-source frontier models by 18 to 29 percentage points on average across the aggregated general benchmarks.

Developers can contribute to the Hugging Face project by wrapping custom agents in the Exgentic protocol for public evaluation. If you build multi-agent coordination patterns, testing your orchestration against this leaderboard will show exactly how your memory management and error recovery logic impacts your bottom line.

Get Insanely Good at AI

Get Insanely Good at AI

The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.

Keep Reading