IBM’s VAKRA Benchmark Exposes Why AI Agents Fail at Complex Tasks
A new IBM Research analysis explores the VAKRA benchmark, revealing how top AI models struggle with multi-hop reasoning and live API chaining in enterprise tools.
On April 15, 2026, IBM Research published a comprehensive analysis of VAKRA, an executable benchmark designed to test multi-hop reasoning across enterprise toolsets. VAKRA forces agents to interact with a live, self-hosted environment of over 8,000 APIs across 62 subject domains. If you build multi-step agents, this benchmark exposes exactly where and why current models fail during complex execution.
Architecture of the Evaluation Environment
Standard static benchmarks grade a language model on its final text output. VAKRA uses trace-level verification to grade the entire execution path. The environment replays an agent’s trajectory against persistent databases to confirm the process was correct at every step.
The benchmark requires reasoning chains spanning three to seven steps. Agents must extract information from unstructured document indices and reconcile it with structured API endpoints. This setup evaluates how function calling works in LLMs when dealing with real-world constraints rather than isolated, single-turn prompts.
Core Capabilities Tested
IBM categorizes the evaluation into four distinct operational requirements. The benchmark tests API chaining by requiring nested tool use through Business Intelligence endpoints in the SLOT-BIRD and SEL-BIRD collections. For tool selection, agents must pinpoint highly specific endpoints from a massive pool using the REST-BIRD collection.
The environment evaluates multi-hop reasoning through dependent chains where an early output must be transformed to parameterize a later call. Finally, it tests multi-source integration and policy adherence. This capability combines structured API queries with retrieval-augmented generation pipelines while enforcing natural-language rules like refusing restricted actions.
Trajectory Failure Modes
The IBM analysis identifies a phenomenon called “Error Compounding Across Hybrid Hop Chains.” Because the environment uses live tool execution, a minor parsing mistake in the first step guarantees a total trajectory failure by the fourth step.
Models frequently fail at entity disambiguation. They struggle to map a user’s plain-text identifier to the rigid schema required by a database. They also fail at cross-source mapping. This occurs when an agent cannot align data retrieved from an unstructured text document with the specific parameters required for a structured API call.
Other documented failures involve schema and parameter alignment errors, where the model formats a JSON response incorrectly for the next sequence. Policy interpretation also degrades during deep reasoning chains, causing agents to ignore explicit natural-language constraints provided in the original system prompt.
Availability and Integration
The source code for the executable environments and task specifications is available on GitHub in the ibm-research/VAKRA repository. The evaluation harness and datasets are hosted on Hugging Face. IBM also launched a live leaderboard on Hugging Face Spaces for researchers to submit agent trajectories. As you evaluate and test AI agents in your own applications, trace-level verification offers a more accurate metric for production readiness than static completion scoring.
When building agentic workflows, you should design your architecture around the inevitability of compounded errors. Implement rigid schema validation between tool calls and provide explicit mapping instructions when passing unstructured context into structured database queries.
Get Insanely Good at AI
The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.
Keep Reading
How to Build Enterprise AI with Mistral Forge on Your Own Data
Learn how Mistral Forge helps enterprises build custom AI models with private data, synthetic data, evals, and flexible deployment.
IBM ALTK-Evolve Lets AI Agents Learn From On-the-Job Mistakes
IBM Research introduces ALTK-Evolve, a new framework that enables AI agents to autonomously improve their performance through real-time environment feedback.
Microsoft Reimagines OpenClaw for a Secure Microsoft 365 Copilot
Microsoft is developing a high-security, always-on AI agent for Microsoft 365 Copilot that aims to fix the vulnerabilities of the popular OpenClaw framework.
Claude Cowork Reimagines the Enterprise as an Agentic Workspace
Anthropic debuts Claude Cowork, introducing multi-agent coordination, persistent team memory, and VPC deployment options for secure corporate collaboration.
NVIDIA Unveils NemoClaw at GTC as a Security-Focused Enterprise AI Agent Platform
NVIDIA introduced NemoClaw, an alpha open-source enterprise agent platform built to add security and privacy controls to OpenClaw workflows.