NVIDIA's Agentic Retrieval Pipeline Tops ViDoRe v3 Benchmark
NVIDIA’s NeMo Retriever shows how ReACT-style agentic retrieval can boost benchmark scores—while exposing major latency and cost trade-offs.
NVIDIA’s NeMo Retriever agentic retrieval pipeline reached #1 on the ViDoRe v3 pipeline leaderboard on March 13, 2026, posting 69.22 NDCG@10. The same public submission date also put NVIDIA at #2 on the BRIGHT leaderboard with 50.9 nDCG@10. For developers building RAG systems, the result matters because it shows measurable gains from treating retrieval as a multi-step reasoning loop instead of a single dense lookup.
The Benchmark Results
The event was a benchmark and publication release, not a new standalone model launch. NVIDIA published “Beyond Semantic Similarity: Introducing NVIDIA NeMo Retriever’s Generalizable Agentic Retrieval Pipeline” on Hugging Face and tied it to public leaderboard results and a BRIGHT submission page.
The two headline numbers are straightforward:
| Benchmark | System | Score | Placement | Date |
|---|---|---|---|---|
| ViDoRe v3 pipeline leaderboard | NeMo Agentic Retrieval, Opus 4.5 + nemotron-colembed-vl-8b-v2 | 69.22 NDCG@10 | #1 | 2026-03-13 |
| BRIGHT leaderboard | NeMo Agentic Retrieval, Opus 4.5 + llama-nv-embed-reasoning-3b | 50.9 nDCG@10 | #2 | 2026-03-13 |
BRIGHT’s official leaderboard still shows INF-X-Retriever ahead at 63.4 nDCG@10, so NVIDIA’s result there is a strong second-place entry rather than a top finish.
That distinction matters. The March 13 story is best read as a pipeline-level retrieval result in a new leaderboard environment, especially on ViDoRe v3, which only introduced its public pipeline framework on February 27, 2026.
Agentic Pipeline vs. Dense Retrieval
NVIDIA’s reported gain on ViDoRe v3 came from adding an agentic controller on top of an already strong embedding model.
For ViDoRe v3, NVIDIA compares three systems using the same base retrieval family:
| ViDoRe v3 system | Score |
|---|---|
Dense retrieval with nemotron-colembed-vl-8b-v2 | 64.36 |
| INF-X-Retriever + same embedding model | 62.31 |
| NeMo Agentic Retrieval, Opus 4.5 + same embedding model | 69.22 |
That is a 4.86-point lift over NVIDIA’s own dense baseline on the same embedding backend.
The BRIGHT uplift is larger in absolute terms. NVIDIA’s GitHub submission reports the underlying dense retriever, llama-nv-embed-reasoning-3b, at 38.3 average nDCG@10, while the full agentic pipeline reaches 50.9.
| BRIGHT system | Score |
|---|---|
Dense llama-nv-embed-reasoning-3b | 38.3 |
| NeMo Agentic Retrieval pipeline | 50.9 |
| Gain | +12.6 |
This is where the architectural point becomes clear. NVIDIA did not beat dense retrieval by swapping in a single better embedding model. It improved results by turning retrieval into a search-and-reason loop.
How the pipeline works
NVIDIA describes the system as a ReACT-style retrieval pipeline. The agent iterates through a tool loop with actions such as think, retrieve(query, top_k), and final_results.
Operationally, that means the controller can:
- decompose a complex question,
- try one retrieval query,
- inspect returned evidence,
- rewrite the next query,
- repeat for several steps,
- then output the final ranked set.
If the process hits a maximum step count or context constraint, NVIDIA falls back to Reciprocal Rank Fusion (RRF) across all retrieval attempts.
For retrieval engineers, this is the important shift. Traditional dense RAG assumes a query embedding is a sufficient proxy for information need. NVIDIA’s March 13 result suggests that on reasoning-heavy and visually rich retrieval tasks, the control policy around retrieval can move the benchmark more than another incremental embedding improvement.
The engineering change that likely helped in practice
One practical detail in NVIDIA’s write-up deserves more attention than the benchmark headline. The team replaced an MCP-server-based tool connection with an in-process, thread-safe singleton retriever.
That change removes several sources of overhead:
- network serialization,
- request round trips,
- tool server lifecycle management,
- repeated model and corpus loading.
The result is better throughput and GPU utilization because the retriever model and its corpus embeddings are loaded once and shared across threads.
If you are building AI agents around retrieval tools, this is directly applicable. Tool invocation overhead becomes a first-order system concern once your agent starts making 9 to 12 retrieval calls per query. A clean in-process retrieval path can matter as much as prompt design.
The model pairings behind the March 13 results
NVIDIA used different backends for the two benchmarks.
| Benchmark | LLM controller | Retrieval backend |
|---|---|---|
| ViDoRe v3 | Claude Opus 4.5 | nemotron-colembed-vl-8b-v2 |
| BRIGHT | Claude Opus 4.5 | llama-nv-embed-reasoning-3b |
NVIDIA also published ablations with gpt-oss-120b as the controller.
| Benchmark | Opus 4.5 | gpt-oss-120b |
|---|---|---|
| ViDoRe v3 | 69.22 | 66.38 |
| BRIGHT | 50.79 to 50.90 | 41.27 |
The controller gap is modest on ViDoRe v3 and large on BRIGHT. That lines up with benchmark design. BRIGHT focuses on reasoning-intensive retrieval, so the quality of the agent’s search policy and query reformulation appears to matter more.
If your workload involves multi-hop retrieval, ambiguous queries, or evidence spread across multiple documents, the choice of controller model may have a larger impact than it does in standard single-hop semantic search.
Cost and latency are the limiting factor
NVIDIA’s own numbers make the production tradeoff explicit. This pipeline is accurate and expensive.
For ViDoRe v3, NVIDIA reports:
- 136.3 seconds per query
- 1,837M total input tokens
- 15M total output tokens
- 9.2 retrieval calls per query
- approximately 760k input tokens per query
- approximately 6.3k output tokens per query
For BRIGHT, the reported Opus-based setup averages:
- 148.2 seconds per query
- 1,251M input tokens
- 11M output tokens
- 11.8 retrieval calls per query
NVIDIA says these were measured sequentially on a single A100 GPU with one concurrent Claude API call.
That means the March 13 announcement is a retrieval quality result, not a deployment template. If your product has an interactive latency budget measured in seconds, these numbers are outside the acceptable range for most user-facing paths.
Benchmark Context
This result landed just two weeks after Hugging Face introduced the ViDoRe v3 pipeline leaderboard, which was designed to evaluate full retrieval pipelines rather than isolated embedding models.
That changes what leaderboard wins mean. Older retrieval leaderboards often rewarded the best one-shot retriever. ViDoRe v3’s pipeline framing makes room for:
- dense retrieval,
- sparse retrieval,
- hybrid retrieval,
- reranking,
- and agentic orchestration.
NVIDIA’s March 13 submission is one of the first prominent examples of a vendor optimizing for that broader category. It is effectively a statement that retrieval evaluation is shifting from “which embedding is best” to “which retrieval system is best.”
For teams that benchmark their own stacks, this is a useful prompt to revisit evaluation design. If you only compare embedding models, you may miss larger gains available from query planning, decomposition, and iterative retrieval.
The BRIGHT result also reflects custom retriever training
NVIDIA’s BRIGHT submission includes unusually detailed training notes for the retrieval backend. The reported llama-nv-embed-reasoning-3b model was trained on synthetic reasoning-oriented query-document pairs, with:
- top-4 similar documents retrieved by another embedder,
- reasoning-intensive queries generated by Qwen3-235B-A22B,
- positive-document annotation by Qwen3-Next-80B-A3B-Instruct,
- hard negatives filtered with Sim_neg < Sim_pos,
- additional data from ReasonEmbed, ReasonAug, and ReasonRank.
That means the BRIGHT score is the product of two layers:
- a reasoning-trained retriever, and
- an LLM retrieval controller.
If you read the March 13 result as pure prompting, you miss the larger engineering picture. The pipeline quality depends on both the agent policy and a retriever trained for reasoning-heavy search.
Practical Takeaways
Three practical implications stand out.
First, iterative retrieval is now benchmark-credible. On March 13, NVIDIA showed that a ReACT-style retrieval loop can beat dense retrieval by meaningful margins on public leaderboards.
Second, the system cost is far above normal production budgets. A path that averages 136 to 148 seconds per query and burns hundreds of thousands of input tokens per request belongs in offline research, analyst workflows, or gated escalation paths.
Third, system design around tools matters. The switch from MCP-style tool serving to an in-process retriever is a concrete signal that retrieval agents need low-overhead tool infrastructure to scale.
If you build retrieval-heavy applications, test an agentic retrieval fallback against your dense baseline for the subset of queries that fail one-shot search. Start with multi-hop, ambiguous, or visually rich documents, measure lift against latency and token cost, and keep the agentic path behind a complexity gate rather than making it your default retrieval mode.
Get Insanely Good at AI
The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.
Keep Reading
Context Engineering: The Most Important AI Skill in 2026
Context engineering is replacing prompt engineering as the critical AI skill. Learn what it is, why it matters more than prompting, and how to manage state, memory, and information flow in AI systems.
How to Choose a Vector Database in 2026
Pinecone, Weaviate, Qdrant, pgvector, or Chroma? Here's how to pick the right vector database for your AI application based on scale, infrastructure, and actual needs.
How to Build a RAG Application (Step by Step)
A practical walkthrough of building a RAG pipeline from scratch: chunking documents, generating embeddings, storing vectors, retrieving context, and generating grounded answers.
Fine-Tuning vs RAG: When to Use Each Approach
RAG changes what the model knows. Fine-tuning changes how it behaves. Here's when to use each approach, their real tradeoffs, and why the answer is usually both.
What Is RAG? Retrieval-Augmented Generation Explained
RAG lets AI models pull in real data before generating a response. Here's how retrieval-augmented generation works, why it matters, and where it breaks down.