NVIDIA's Agentic Retrieval Pipeline Tops ViDoRe v3 Benchmark

NVIDIA’s NeMo Retriever agentic retrieval pipeline reached #1 on the ViDoRe v3 pipeline leaderboard on March 13, 2026, posting 69.22 NDCG@10. The same submission date put NVIDIA at #2 on the BRIGHT leaderboard with 50.9 nDCG@10. For developers building RAG systems, the result shows measurable gains from treating retrieval as a multi-step reasoning loop instead of a single dense lookup.

The Benchmark Results

NVIDIA published “Beyond Semantic Similarity: Introducing NVIDIA NeMo Retriever’s Generalizable Agentic Retrieval Pipeline” on Hugging Face and tied it to public leaderboard results. BRIGHT’s official leaderboard still shows INF-X-Retriever ahead at 63.4 nDCG@10, so NVIDIA’s BRIGHT result is a strong second-place entry. The March 13 story is best read as a pipeline-level retrieval result in a new leaderboard environment. ViDoRe v3 only introduced its public pipeline framework on February 27, 2026.

Agentic Pipeline vs. Dense Retrieval

NVIDIA’s gain on ViDoRe v3 came from adding an agentic controller on top of an already strong embedding model. Using the same base retrieval family:

ViDoRe v3 system	Score
Dense retrieval with `nemotron-colembed-vl-8b-v2`	64.36
NeMo Agentic Retrieval, Opus 4.5 + same embedding model	69.22

The agentic pipeline delivers a 4.86-point lift over NVIDIA’s own dense baseline. On BRIGHT, the uplift is larger. The underlying dense retriever llama-nv-embed-reasoning-3b scores 38.3 average nDCG@10, while the full agentic pipeline reaches 50.9 (a +12.6 gain). NVIDIA improved results by turning retrieval into a search-and-reason loop, not by swapping in a single better embedding model.

How the Pipeline Works

NVIDIA describes the system as a ReACT-style retrieval pipeline. The agent iterates through a tool loop with actions such as think, retrieve(query, top_k), and final_results. The controller can decompose a complex question, try one retrieval query, inspect returned evidence, rewrite the next query, repeat for several steps, then output the final ranked set. If the process hits a maximum step count or context constraint, NVIDIA falls back to Reciprocal Rank Fusion (RRF) across all retrieval attempts.

Traditional dense RAG assumes a query embedding is a sufficient proxy for information need. On reasoning-heavy and visually rich retrieval tasks, the control policy around retrieval can move the benchmark more than another incremental embedding improvement.

Engineering Change: In-Process Retriever

The team replaced an MCP-server-based tool connection with an in-process, thread-safe singleton retriever. The change removes network serialization, request round trips, tool server lifecycle management, and repeated model and corpus loading. Throughput and GPU utilization improve because the retriever model and its corpus embeddings are loaded once and shared across threads.

If you build AI agents around retrieval tools, this is directly applicable. Tool invocation overhead becomes a first-order concern once your agent makes 9 to 12 retrieval calls per query.

Cost and Latency

NVIDIA’s own numbers make the production tradeoff explicit. For ViDoRe v3, the pipeline averages 136.3 seconds per query, 9.2 retrieval calls per query, and approximately 760k input tokens per query. For BRIGHT, the Opus-based setup averages 148.2 seconds per query and 11.8 retrieval calls per query. These were measured sequentially on a single A100 GPU with one concurrent Claude API call.

The March 13 announcement is a retrieval quality result, not a deployment template. If your product has an interactive latency budget measured in seconds, these numbers fall outside the acceptable range for most user-facing paths.

Model Pairings

NVIDIA used different backends for the two benchmarks: Claude Opus 4.5 with nemotron-colembed-vl-8b-v2 for ViDoRe v3, and Claude Opus 4.5 with llama-nv-embed-reasoning-3b for BRIGHT. Ablations with gpt-oss-120b as the controller show a modest gap on ViDoRe v3 (69.22 vs 66.38) and a large gap on BRIGHT (50.79 to 50.90 vs 41.27). BRIGHT focuses on reasoning-intensive retrieval, so the quality of the agent’s search policy and query reformulation matters more there.

If your workload involves multi-hop retrieval, ambiguous queries, or evidence spread across multiple documents, the choice of controller model may have a larger impact than in standard single-hop semantic search.

For teams that benchmark their own stacks, revisit evaluation design. If you only compare embedding models, you may miss larger gains from query planning, decomposition, and iterative retrieval. Test an agentic retrieval fallback against your dense baseline for the subset of queries that fail one-shot search. Start with multi-hop, ambiguous, or visually rich documents, measure lift against latency and token cost, and keep the agentic path behind a complexity gate rather than making it your default retrieval mode.

NVIDIA's Agentic Retrieval Pipeline Tops ViDoRe v3 Benchmark

The Benchmark Results

Agentic Pipeline vs. Dense Retrieval

How the Pipeline Works

Engineering Change: In-Process Retriever

Cost and Latency

Model Pairings

Keep Reading

How to Build a Domain-Specific Embedding Model

IBM's Mellea 0.4.0 Adds Agent Tooling to Granite Models

Continued Pretraining vs RAG: Two Ways to Add Knowledge

Context Engineering: The Most Important AI Skill in 2026

How to Choose a Vector Database in 2026