NVIDIA's Agentic Retrieval Pipeline Tops ViDoRe v3 Benchmark
NVIDIA's NeMo Retriever shows how ReACT-style agentic retrieval can boost benchmark scores—while exposing major latency and cost trade-offs.
NVIDIA’s NeMo Retriever agentic retrieval pipeline reached #1 on the ViDoRe v3 pipeline leaderboard on March 13, 2026, posting 69.22 NDCG@10. The same submission date put NVIDIA at #2 on the BRIGHT leaderboard with 50.9 nDCG@10. For developers building RAG systems, the result shows measurable gains from treating retrieval as a multi-step reasoning loop instead of a single dense lookup.
The Benchmark Results
NVIDIA published “Beyond Semantic Similarity: Introducing NVIDIA NeMo Retriever’s Generalizable Agentic Retrieval Pipeline” on Hugging Face and tied it to public leaderboard results. BRIGHT’s official leaderboard still shows INF-X-Retriever ahead at 63.4 nDCG@10, so NVIDIA’s BRIGHT result is a strong second-place entry. The March 13 story is best read as a pipeline-level retrieval result in a new leaderboard environment. ViDoRe v3 only introduced its public pipeline framework on February 27, 2026.
Agentic Pipeline vs. Dense Retrieval
NVIDIA’s gain on ViDoRe v3 came from adding an agentic controller on top of an already strong embedding model. Using the same base retrieval family:
| ViDoRe v3 system | Score |
|---|---|
Dense retrieval with nemotron-colembed-vl-8b-v2 | 64.36 |
| NeMo Agentic Retrieval, Opus 4.5 + same embedding model | 69.22 |
The agentic pipeline delivers a 4.86-point lift over NVIDIA’s own dense baseline. On BRIGHT, the uplift is larger. The underlying dense retriever llama-nv-embed-reasoning-3b scores 38.3 average nDCG@10, while the full agentic pipeline reaches 50.9 (a +12.6 gain). NVIDIA improved results by turning retrieval into a search-and-reason loop, not by swapping in a single better embedding model.
How the Pipeline Works
NVIDIA describes the system as a ReACT-style retrieval pipeline. The agent iterates through a tool loop with actions such as think, retrieve(query, top_k), and final_results. The controller can decompose a complex question, try one retrieval query, inspect returned evidence, rewrite the next query, repeat for several steps, then output the final ranked set. If the process hits a maximum step count or context constraint, NVIDIA falls back to Reciprocal Rank Fusion (RRF) across all retrieval attempts.
Traditional dense RAG assumes a query embedding is a sufficient proxy for information need. On reasoning-heavy and visually rich retrieval tasks, the control policy around retrieval can move the benchmark more than another incremental embedding improvement.
Engineering Change: In-Process Retriever
The team replaced an MCP-server-based tool connection with an in-process, thread-safe singleton retriever. The change removes network serialization, request round trips, tool server lifecycle management, and repeated model and corpus loading. Throughput and GPU utilization improve because the retriever model and its corpus embeddings are loaded once and shared across threads.
If you build AI agents around retrieval tools, this is directly applicable. Tool invocation overhead becomes a first-order concern once your agent makes 9 to 12 retrieval calls per query.
Cost and Latency
NVIDIA’s own numbers make the production tradeoff explicit. For ViDoRe v3, the pipeline averages 136.3 seconds per query, 9.2 retrieval calls per query, and approximately 760k input tokens per query. For BRIGHT, the Opus-based setup averages 148.2 seconds per query and 11.8 retrieval calls per query. These were measured sequentially on a single A100 GPU with one concurrent Claude API call.
The March 13 announcement is a retrieval quality result, not a deployment template. If your product has an interactive latency budget measured in seconds, these numbers fall outside the acceptable range for most user-facing paths.
Model Pairings
NVIDIA used different backends for the two benchmarks: Claude Opus 4.5 with nemotron-colembed-vl-8b-v2 for ViDoRe v3, and Claude Opus 4.5 with llama-nv-embed-reasoning-3b for BRIGHT. Ablations with gpt-oss-120b as the controller show a modest gap on ViDoRe v3 (69.22 vs 66.38) and a large gap on BRIGHT (50.79 to 50.90 vs 41.27). BRIGHT focuses on reasoning-intensive retrieval, so the quality of the agent’s search policy and query reformulation matters more there.
If your workload involves multi-hop retrieval, ambiguous queries, or evidence spread across multiple documents, the choice of controller model may have a larger impact than in standard single-hop semantic search.
For teams that benchmark their own stacks, revisit evaluation design. If you only compare embedding models, you may miss larger gains from query planning, decomposition, and iterative retrieval. Test an agentic retrieval fallback against your dense baseline for the subset of queries that fail one-shot search. Start with multi-hop, ambiguous, or visually rich documents, measure lift against latency and token cost, and keep the agentic path behind a complexity gate rather than making it your default retrieval mode.
Get Insanely Good at AI
The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.
Keep Reading
How to Build a Domain-Specific Embedding Model
Learn NVIDIA's recipe for fine-tuning a domain-specific embedding model in hours using synthetic data, hard negatives, BEIR, and NIM.
IBM's Mellea 0.4.0 Adds Agent Tooling to Granite Models
IBM Granite announced Mellea 0.4.0 and three LoRA-based libraries for RAG, validation, and safety on granite-4.0-micro.
Continued Pretraining vs RAG: Two Ways to Add Knowledge
Continued pretraining bakes knowledge into model weights. RAG injects it at query time. When to use each, where each breaks down, and why you often need both.
Context Engineering: The Most Important AI Skill in 2026
Context engineering is replacing prompt engineering as the critical AI skill. Learn what it is, why it matters more than prompting, and how to manage state, memory, and information flow in AI systems.
How to Choose a Vector Database in 2026
Pinecone, Weaviate, Qdrant, pgvector, or Chroma? Here's how to pick the right vector database for your AI application based on scale, infrastructure, and actual needs.