Ai Engineering 9 min read

How to Build a Domain-Specific Embedding Model

Learn NVIDIA's recipe for fine-tuning a domain-specific embedding model in hours using synthetic data, hard negatives, BEIR, and NIM.

NVIDIA’s March 20 release gives you a practical path to turn nvidia/llama-nemotron-embed-1b-v2 into a domain-specific retriever in a few hours on one 80 GB GPU. You can use the published Nemotron recipe to generate synthetic retrieval data, mine hard negatives, fine-tune the embedding model, evaluate with BEIR, and deploy behind an OpenAI-compatible embeddings endpoint.

When this approach makes sense

A domain-specific embedding model is useful when your retrieval quality is limited by terminology, document structure, or query style that generic embedding models do not capture well. This applies to product docs, internal knowledge bases, support content, regulated text, and any corpus where query-document relevance depends on domain language.

The base model, nvidia/llama-nemotron-embed-1b-v2, is already a strong multilingual retriever. It supports 26 languages, accepts up to 8192 tokens, and can emit embeddings with dimensions 384, 512, 768, 1024, or 2048. If you need a refresher on how embedding models affect retrieval quality, see embeddings and the role they play in RAG systems.

This workflow is a good fit when you already have a corpus and want better retrieval without doing continued pretraining. That tradeoff is similar to the distinction covered in fine-tuning vs RAG and continued pretraining vs RAG.

What the workflow includes

The NVIDIA recipe is organized into six stages:

StageCommandPurpose
Synthetic data generationnemotron embed sdgGenerate synthetic question-answer pairs from your documents
Data preparationnemotron embed prepFormat the dataset and mine hard negatives
Fine-tuningnemotron embed finetuneTrain the embedding model on your domain pairs
Evaluationnemotron embed evalMeasure retrieval quality with BEIR metrics
Exportnemotron embed exportExport to ONNX, optionally prepare TensorRT
Deploymentnemotron embed deployServe through NVIDIA NIM

You can find the full recipe in the Nemotron embedding workflow, and the end-to-end walkthrough in the Hugging Face release post.

Hardware and prerequisites

You need these prerequisites before you start:

RequirementValue
GPUNVIDIA Ampere or newer
GPU memoryAt least 80 GB
Tested configurations1x A100 80GB, 1x H100 80GB
CredentialsNVIDIA API key

The published timing assumes this class of hardware. For a corpus of around 500 documents, the full flow takes about 2 to 3 hours. The larger under-a-day estimate covers the complete pipeline including generation, training, evaluation, export, and deployment.

Prepare your corpus

The input to this recipe is your domain document set. The synthetic data generation stage creates retrieval training pairs from those documents, so corpus quality matters more than prompt cleverness.

Keep documents clean and chunked in a way that preserves answerable context. Retrieval training works best when each passage can support realistic questions. If your chunks are too small, you lose the evidence needed for multi-hop or specific questions. If they are too large, positives become diffuse and negatives become less informative.

The released example dataset, nvidia/Retrieval-Synthetic-NVDocs-v1, contains 15.1k rows and is useful as a reference for expected structure and scale. You can inspect it on Hugging Face Datasets.

Generate synthetic retrieval data

The synthetic data generation, or SDG, stage uses NeMo Data Designer to create question-answer pairs from your documents. The pipeline produces both simple and multi-hop questions, with:

SettingValue
Complexity levels2 to 5
Hop counts1 to 3
Quality filterScore-based filtering
Default quality threshold7.0

Only examples above the threshold are kept for training. That filter matters because low-quality synthetic queries create noisy positives and make hard negative mining less reliable.

The tutorial uses nvidia/nemotron-3-nano-30b-a3b for synthetic Q&A generation. If your domain has specialized vocabulary, this stage is where most of the adaptation signal comes from. You are effectively teaching the retriever how users in your domain are likely to ask for content.

NVIDIA reports this stage takes about 1 hour in the reference setup.

Prepare training data and mine hard negatives

The prep stage is where the recipe becomes much more useful than a basic positive-pair fine-tune. It embeds queries and passages, masks labeled positives, then mines hard negatives from the nearest remaining passages.

The default hard negative settings are:

ParameterValue
Hard negatives to mine5
Margin filter95% of minimum positive score
Default negatives used in training4

The 95% margin filter reduces likely false negatives by excluding passages that score too close to known positives. That is important in dense retrieval, especially in technical corpora where multiple passages may legitimately answer the same question.

NVIDIA reports the prep stage takes about 5 minutes.

Fine-tune the embedding model

The fine-tuning stage uses NeMo Automodel and trains the base retriever with one positive and four hard negatives per query. The default hyperparameters are:

HyperparameterDefault
Base modelnvidia/llama-nemotron-embed-1b-v2
Epochs3
Learning rate1e-5
Warmup steps5
Global batch size128
Passages per query5
Passage mix1 positive + 4 hard negatives

These defaults are tuned for a fast proof of concept. For many teams, the right first run is the default configuration on a representative subset of the corpus, followed by a relevance review of the misses.

The recipe also auto-scales for small datasets. If you have fewer than 2,000 training examples, it reduces batch size to 16 to 64, adjusts checkpoint frequency, and scales validation frequency. That makes small pilot runs practical with as few as 50 to 100 documents.

Training time is about 1 hour in the reference workflow.

Evaluate with retrieval metrics that matter

The evaluation stage uses BEIR, which is the correct choice for retrieval benchmarking because it measures ranking quality directly. The reported metrics are:

MetricWhat it captures
nDCG@kRanking quality with position-aware relevance
Recall@kWhether relevant documents are retrieved in the top k
Precision@kRelevance concentration in the top k
MAP@kAverage precision across the ranking

For the released synthetic benchmark, the published gains are:

MetricBaseFine-tunedRelative gain
NDCG@100.555060.61559+10.9%
Recall@100.629790.69296+10.0%

A successful fine-tune typically yields about 15% improvement in nDCG@10 and Recall@10 within less than one day. In the cited validation case, Recall@60 improved from 0.751 to 0.951, a 26.7% gain on a single A100 80GB.

This is where you decide whether the model is ready for production. If your application uses retrieval as part of an agent pipeline, track offline relevance before wiring it into downstream orchestration. That matches the same discipline used in evaluating agents, even though the unit under test here is retrieval rather than action planning.

Export for production inference

After evaluation, export the model for serving. The supported deployment path includes:

Export optionValue
ONNX opset17
TensorRTOptional
Batch optimization profiles1 to 64
Sequence length profiles3 to 256
QuantizationOptional FP8 via quant_cfg=fp8

The export step takes about 5 minutes in the reference workflow.

The main tradeoff here is straightforward. ONNX gives you a portable deployment target quickly. TensorRT adds extra optimization work but is the path to tighter inference performance on NVIDIA hardware. If your serving stack is already standardized on GPU inference infrastructure, that extra step usually makes sense.

Deploy with NVIDIA NIM

The deployment target is NVIDIA NIM. Once deployed, the service exposes an OpenAI-compatible /v1/embeddings endpoint, which makes integration with existing embedding clients much easier.

That matters if your application already expects an embeddings API and a vector database downstream. You can swap in the fine-tuned model without redesigning your application interface. If you are still deciding on storage and indexing, this fits naturally with the database-side considerations in choosing a vector database.

The workflow includes an accuracy verification step that compares deployed NIM results against BEIR evaluation. The published tolerances are:

CheckTolerance
@10.03
@5+0.01

That verification step is worth keeping in your release process. Embedding export and serving changes can affect ranking behavior enough to matter.

Practical tradeoffs and limits

This recipe is fast, but it has clear constraints.

The biggest one is hardware. You need an Ampere-or-newer GPU with 80 GB memory, and the published setup was tested on A100 80GB and H100 80GB. That puts it out of reach for lightweight local experimentation.

The second constraint is dependency on synthetic data quality. If your corpus is messy, redundant, or weakly structured, synthetic questions can mirror that quality. The thresholding and negative mining help, but they do not replace corpus cleanup.

Third, this workflow optimizes for retrieval, not generation. If your application is failing because the answer model cannot synthesize or reason over retrieved context, improving embeddings only fixes the first half of the pipeline.

The fastest way to run a first experiment

Start with a focused corpus slice, around 50 to 100 documents, and let the recipe auto-scale for the smaller dataset. Keep the published defaults for epochs, learning rate, batch size behavior, and hard negative mining. Then compare nDCG@10 and Recall@10 against the base llama-nemotron-embed-1b-v2 model before you export anything.

If the gains hold on your real retrieval queries, move straight to ONNX export and a NIM deployment behind /v1/embeddings, then reindex your vector store with the new embeddings.

Get Insanely Good at AI

Get Insanely Good at AI

The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.

Keep Reading