How to Build a Domain-Specific Embedding Model

NVIDIA’s March 20 release gives you a practical path to turn nvidia/llama-nemotron-embed-1b-v2 into a domain-specific retriever in a few hours on one 80 GB GPU. You can use the published Nemotron recipe to generate synthetic retrieval data, mine hard negatives, fine-tune the embedding model, evaluate with BEIR, and deploy behind an OpenAI-compatible embeddings endpoint.

When this approach makes sense

A domain-specific embedding model is useful when your retrieval quality is limited by terminology, document structure, or query style that generic embedding models do not capture well. This applies to product docs, internal knowledge bases, support content, regulated text, and any corpus where query-document relevance depends on domain language.

The base model, nvidia/llama-nemotron-embed-1b-v2, is already a strong multilingual retriever. It supports 26 languages, accepts up to 8192 tokens, and can emit embeddings with dimensions 384, 512, 768, 1024, or 2048. If you need a refresher on how embedding models affect retrieval quality, see embeddings and the role they play in RAG systems.

This workflow is a good fit when you already have a corpus and want better retrieval without doing continued pretraining. That tradeoff is similar to the distinction covered in fine-tuning vs RAG and continued pretraining vs RAG.

What the workflow includes

The NVIDIA recipe is organized into six stages:

Stage	Command	Purpose
Synthetic data generation	`nemotron embed sdg`	Generate synthetic question-answer pairs from your documents
Data preparation	`nemotron embed prep`	Format the dataset and mine hard negatives
Fine-tuning	`nemotron embed finetune`	Train the embedding model on your domain pairs
Evaluation	`nemotron embed eval`	Measure retrieval quality with BEIR metrics
Export	`nemotron embed export`	Export to ONNX, optionally prepare TensorRT
Deployment	`nemotron embed deploy`	Serve through NVIDIA NIM

You can find the full recipe in the Nemotron embedding workflow, and the end-to-end walkthrough in the Hugging Face release post.

Hardware and prerequisites

You need these prerequisites before you start:

Requirement	Value
GPU	NVIDIA Ampere or newer
GPU memory	At least 80 GB
Tested configurations	1x A100 80GB, 1x H100 80GB
Credentials	NVIDIA API key

The published timing assumes this class of hardware. For a corpus of around 500 documents, the full flow takes about 2 to 3 hours. The larger under-a-day estimate covers the complete pipeline including generation, training, evaluation, export, and deployment.

Prepare your corpus

The input to this recipe is your domain document set. The synthetic data generation stage creates retrieval training pairs from those documents, so corpus quality matters more than prompt cleverness.

Keep documents clean and chunked in a way that preserves answerable context. Retrieval training works best when each passage can support realistic questions. If your chunks are too small, you lose the evidence needed for multi-hop or specific questions. If they are too large, positives become diffuse and negatives become less informative.

The released example dataset, nvidia/Retrieval-Synthetic-NVDocs-v1, contains 15.1k rows and is useful as a reference for expected structure and scale. You can inspect it on Hugging Face Datasets.

Generate synthetic retrieval data

The synthetic data generation, or SDG, stage uses NeMo Data Designer to create question-answer pairs from your documents. The pipeline produces both simple and multi-hop questions, with:

Setting	Value
Complexity levels	2 to 5
Hop counts	1 to 3
Quality filter	Score-based filtering
Default quality threshold	7.0

Only examples above the threshold are kept for training. That filter matters because low-quality synthetic queries create noisy positives and make hard negative mining less reliable.

The tutorial uses nvidia/nemotron-3-nano-30b-a3b for synthetic Q&A generation. If your domain has specialized vocabulary, this stage is where most of the adaptation signal comes from. You are effectively teaching the retriever how users in your domain are likely to ask for content.

NVIDIA reports this stage takes about 1 hour in the reference setup.

Prepare training data and mine hard negatives

The prep stage is where the recipe becomes much more useful than a basic positive-pair fine-tune. It embeds queries and passages, masks labeled positives, then mines hard negatives from the nearest remaining passages.

The default hard negative settings are:

Parameter	Value
Hard negatives to mine	5
Margin filter	95% of minimum positive score
Default negatives used in training	4

The 95% margin filter reduces likely false negatives by excluding passages that score too close to known positives. That is important in dense retrieval, especially in technical corpora where multiple passages may legitimately answer the same question.

NVIDIA reports the prep stage takes about 5 minutes.

Fine-tune the embedding model

The fine-tuning stage uses NeMo Automodel and trains the base retriever with one positive and four hard negatives per query. The default hyperparameters are:

Hyperparameter	Default
Base model	`nvidia/llama-nemotron-embed-1b-v2`
Epochs	3
Learning rate	`1e-5`
Warmup steps	5
Global batch size	128
Passages per query	5
Passage mix	1 positive + 4 hard negatives

These defaults are tuned for a fast proof of concept. For many teams, the right first run is the default configuration on a representative subset of the corpus, followed by a relevance review of the misses.

The recipe also auto-scales for small datasets. If you have fewer than 2,000 training examples, it reduces batch size to 16 to 64, adjusts checkpoint frequency, and scales validation frequency. That makes small pilot runs practical with as few as 50 to 100 documents.

Training time is about 1 hour in the reference workflow.

Evaluate with retrieval metrics that matter

The evaluation stage uses BEIR, which is the correct choice for retrieval benchmarking because it measures ranking quality directly. The reported metrics are:

Metric	What it captures
nDCG@k	Ranking quality with position-aware relevance
Recall@k	Whether relevant documents are retrieved in the top k
Precision@k	Relevance concentration in the top k
MAP@k	Average precision across the ranking

For the released synthetic benchmark, the published gains are:

Metric	Base	Fine-tuned	Relative gain
NDCG@10	`0.55506`	`0.61559`	+10.9%
Recall@10	`0.62979`	`0.69296`	+10.0%

A successful fine-tune typically yields about 15% improvement in nDCG@10 and Recall@10 within less than one day. In the cited validation case, Recall@60 improved from 0.751 to 0.951, a 26.7% gain on a single A100 80GB.

This is where you decide whether the model is ready for production. If your application uses retrieval as part of an agent pipeline, track offline relevance before wiring it into downstream orchestration. That matches the same discipline used in evaluating agents, even though the unit under test here is retrieval rather than action planning.

Export for production inference

After evaluation, export the model for serving. The supported deployment path includes:

Export option	Value
ONNX opset	17
TensorRT	Optional
Batch optimization profiles	1 to 64
Sequence length profiles	3 to 256
Quantization	Optional FP8 via `quant_cfg=fp8`

The export step takes about 5 minutes in the reference workflow.

The main tradeoff here is straightforward. ONNX gives you a portable deployment target quickly. TensorRT adds extra optimization work but is the path to tighter inference performance on NVIDIA hardware. If your serving stack is already standardized on GPU inference infrastructure, that extra step usually makes sense.

Deploy with NVIDIA NIM

The deployment target is NVIDIA NIM. Once deployed, the service exposes an OpenAI-compatible /v1/embeddings endpoint, which makes integration with existing embedding clients much easier.

That matters if your application already expects an embeddings API and a vector database downstream. You can swap in the fine-tuned model without redesigning your application interface. If you are still deciding on storage and indexing, this fits naturally with the database-side considerations in choosing a vector database.

The workflow includes an accuracy verification step that compares deployed NIM results against BEIR evaluation. The published tolerances are:

Check	Tolerance
`@1`	0.03
`@5+`	0.01

That verification step is worth keeping in your release process. Embedding export and serving changes can affect ranking behavior enough to matter.

Practical tradeoffs and limits

This recipe is fast, but it has clear constraints.

The biggest one is hardware. You need an Ampere-or-newer GPU with 80 GB memory, and the published setup was tested on A100 80GB and H100 80GB. That puts it out of reach for lightweight local experimentation.

The second constraint is dependency on synthetic data quality. If your corpus is messy, redundant, or weakly structured, synthetic questions can mirror that quality. The thresholding and negative mining help, but they do not replace corpus cleanup.

Third, this workflow optimizes for retrieval, not generation. If your application is failing because the answer model cannot synthesize or reason over retrieved context, improving embeddings only fixes the first half of the pipeline.

The fastest way to run a first experiment

Start with a focused corpus slice, around 50 to 100 documents, and let the recipe auto-scale for the smaller dataset. Keep the published defaults for epochs, learning rate, batch size behavior, and hard negative mining. Then compare nDCG@10 and Recall@10 against the base llama-nemotron-embed-1b-v2 model before you export anything.

If the gains hold on your real retrieval queries, move straight to ONNX export and a NIM deployment behind /v1/embeddings, then reindex your vector store with the new embeddings.

How to Build a Domain-Specific Embedding Model

When this approach makes sense

What the workflow includes

Hardware and prerequisites

Prepare your corpus

Generate synthetic retrieval data

Prepare training data and mine hard negatives

Fine-tune the embedding model

Evaluate with retrieval metrics that matter

Export for production inference

Deploy with NVIDIA NIM

Practical tradeoffs and limits

The fastest way to run a first experiment

Keep Reading

H Company Releases Holotron-12B Computer-Use Agent on Hugging Face

How to Fine-Tune Qwen3 on AMD MI300X Using ROCm

How to Use Multimodal Sentence Transformers v5.4

How to Choose a Vector Database in 2026

How to Build a RAG Application (Step by Step)