How to Build a Domain-Specific Embedding Model
Learn NVIDIA's recipe for fine-tuning a domain-specific embedding model in hours using synthetic data, hard negatives, BEIR, and NIM.
NVIDIA’s March 20 release gives you a practical path to turn nvidia/llama-nemotron-embed-1b-v2 into a domain-specific retriever in a few hours on one 80 GB GPU. You can use the published Nemotron recipe to generate synthetic retrieval data, mine hard negatives, fine-tune the embedding model, evaluate with BEIR, and deploy behind an OpenAI-compatible embeddings endpoint.
When this approach makes sense
A domain-specific embedding model is useful when your retrieval quality is limited by terminology, document structure, or query style that generic embedding models do not capture well. This applies to product docs, internal knowledge bases, support content, regulated text, and any corpus where query-document relevance depends on domain language.
The base model, nvidia/llama-nemotron-embed-1b-v2, is already a strong multilingual retriever. It supports 26 languages, accepts up to 8192 tokens, and can emit embeddings with dimensions 384, 512, 768, 1024, or 2048. If you need a refresher on how embedding models affect retrieval quality, see embeddings and the role they play in RAG systems.
This workflow is a good fit when you already have a corpus and want better retrieval without doing continued pretraining. That tradeoff is similar to the distinction covered in fine-tuning vs RAG and continued pretraining vs RAG.
What the workflow includes
The NVIDIA recipe is organized into six stages:
| Stage | Command | Purpose |
|---|---|---|
| Synthetic data generation | nemotron embed sdg | Generate synthetic question-answer pairs from your documents |
| Data preparation | nemotron embed prep | Format the dataset and mine hard negatives |
| Fine-tuning | nemotron embed finetune | Train the embedding model on your domain pairs |
| Evaluation | nemotron embed eval | Measure retrieval quality with BEIR metrics |
| Export | nemotron embed export | Export to ONNX, optionally prepare TensorRT |
| Deployment | nemotron embed deploy | Serve through NVIDIA NIM |
You can find the full recipe in the Nemotron embedding workflow, and the end-to-end walkthrough in the Hugging Face release post.
Hardware and prerequisites
You need these prerequisites before you start:
| Requirement | Value |
|---|---|
| GPU | NVIDIA Ampere or newer |
| GPU memory | At least 80 GB |
| Tested configurations | 1x A100 80GB, 1x H100 80GB |
| Credentials | NVIDIA API key |
The published timing assumes this class of hardware. For a corpus of around 500 documents, the full flow takes about 2 to 3 hours. The larger under-a-day estimate covers the complete pipeline including generation, training, evaluation, export, and deployment.
Prepare your corpus
The input to this recipe is your domain document set. The synthetic data generation stage creates retrieval training pairs from those documents, so corpus quality matters more than prompt cleverness.
Keep documents clean and chunked in a way that preserves answerable context. Retrieval training works best when each passage can support realistic questions. If your chunks are too small, you lose the evidence needed for multi-hop or specific questions. If they are too large, positives become diffuse and negatives become less informative.
The released example dataset, nvidia/Retrieval-Synthetic-NVDocs-v1, contains 15.1k rows and is useful as a reference for expected structure and scale. You can inspect it on Hugging Face Datasets.
Generate synthetic retrieval data
The synthetic data generation, or SDG, stage uses NeMo Data Designer to create question-answer pairs from your documents. The pipeline produces both simple and multi-hop questions, with:
| Setting | Value |
|---|---|
| Complexity levels | 2 to 5 |
| Hop counts | 1 to 3 |
| Quality filter | Score-based filtering |
| Default quality threshold | 7.0 |
Only examples above the threshold are kept for training. That filter matters because low-quality synthetic queries create noisy positives and make hard negative mining less reliable.
The tutorial uses nvidia/nemotron-3-nano-30b-a3b for synthetic Q&A generation. If your domain has specialized vocabulary, this stage is where most of the adaptation signal comes from. You are effectively teaching the retriever how users in your domain are likely to ask for content.
NVIDIA reports this stage takes about 1 hour in the reference setup.
Prepare training data and mine hard negatives
The prep stage is where the recipe becomes much more useful than a basic positive-pair fine-tune. It embeds queries and passages, masks labeled positives, then mines hard negatives from the nearest remaining passages.
The default hard negative settings are:
| Parameter | Value |
|---|---|
| Hard negatives to mine | 5 |
| Margin filter | 95% of minimum positive score |
| Default negatives used in training | 4 |
The 95% margin filter reduces likely false negatives by excluding passages that score too close to known positives. That is important in dense retrieval, especially in technical corpora where multiple passages may legitimately answer the same question.
NVIDIA reports the prep stage takes about 5 minutes.
Fine-tune the embedding model
The fine-tuning stage uses NeMo Automodel and trains the base retriever with one positive and four hard negatives per query. The default hyperparameters are:
| Hyperparameter | Default |
|---|---|
| Base model | nvidia/llama-nemotron-embed-1b-v2 |
| Epochs | 3 |
| Learning rate | 1e-5 |
| Warmup steps | 5 |
| Global batch size | 128 |
| Passages per query | 5 |
| Passage mix | 1 positive + 4 hard negatives |
These defaults are tuned for a fast proof of concept. For many teams, the right first run is the default configuration on a representative subset of the corpus, followed by a relevance review of the misses.
The recipe also auto-scales for small datasets. If you have fewer than 2,000 training examples, it reduces batch size to 16 to 64, adjusts checkpoint frequency, and scales validation frequency. That makes small pilot runs practical with as few as 50 to 100 documents.
Training time is about 1 hour in the reference workflow.
Evaluate with retrieval metrics that matter
The evaluation stage uses BEIR, which is the correct choice for retrieval benchmarking because it measures ranking quality directly. The reported metrics are:
| Metric | What it captures |
|---|---|
| nDCG@k | Ranking quality with position-aware relevance |
| Recall@k | Whether relevant documents are retrieved in the top k |
| Precision@k | Relevance concentration in the top k |
| MAP@k | Average precision across the ranking |
For the released synthetic benchmark, the published gains are:
| Metric | Base | Fine-tuned | Relative gain |
|---|---|---|---|
| NDCG@10 | 0.55506 | 0.61559 | +10.9% |
| Recall@10 | 0.62979 | 0.69296 | +10.0% |
A successful fine-tune typically yields about 15% improvement in nDCG@10 and Recall@10 within less than one day. In the cited validation case, Recall@60 improved from 0.751 to 0.951, a 26.7% gain on a single A100 80GB.
This is where you decide whether the model is ready for production. If your application uses retrieval as part of an agent pipeline, track offline relevance before wiring it into downstream orchestration. That matches the same discipline used in evaluating agents, even though the unit under test here is retrieval rather than action planning.
Export for production inference
After evaluation, export the model for serving. The supported deployment path includes:
| Export option | Value |
|---|---|
| ONNX opset | 17 |
| TensorRT | Optional |
| Batch optimization profiles | 1 to 64 |
| Sequence length profiles | 3 to 256 |
| Quantization | Optional FP8 via quant_cfg=fp8 |
The export step takes about 5 minutes in the reference workflow.
The main tradeoff here is straightforward. ONNX gives you a portable deployment target quickly. TensorRT adds extra optimization work but is the path to tighter inference performance on NVIDIA hardware. If your serving stack is already standardized on GPU inference infrastructure, that extra step usually makes sense.
Deploy with NVIDIA NIM
The deployment target is NVIDIA NIM. Once deployed, the service exposes an OpenAI-compatible /v1/embeddings endpoint, which makes integration with existing embedding clients much easier.
That matters if your application already expects an embeddings API and a vector database downstream. You can swap in the fine-tuned model without redesigning your application interface. If you are still deciding on storage and indexing, this fits naturally with the database-side considerations in choosing a vector database.
The workflow includes an accuracy verification step that compares deployed NIM results against BEIR evaluation. The published tolerances are:
| Check | Tolerance |
|---|---|
@1 | 0.03 |
@5+ | 0.01 |
That verification step is worth keeping in your release process. Embedding export and serving changes can affect ranking behavior enough to matter.
Practical tradeoffs and limits
This recipe is fast, but it has clear constraints.
The biggest one is hardware. You need an Ampere-or-newer GPU with 80 GB memory, and the published setup was tested on A100 80GB and H100 80GB. That puts it out of reach for lightweight local experimentation.
The second constraint is dependency on synthetic data quality. If your corpus is messy, redundant, or weakly structured, synthetic questions can mirror that quality. The thresholding and negative mining help, but they do not replace corpus cleanup.
Third, this workflow optimizes for retrieval, not generation. If your application is failing because the answer model cannot synthesize or reason over retrieved context, improving embeddings only fixes the first half of the pipeline.
The fastest way to run a first experiment
Start with a focused corpus slice, around 50 to 100 documents, and let the recipe auto-scale for the smaller dataset. Keep the published defaults for epochs, learning rate, batch size behavior, and hard negative mining. Then compare nDCG@10 and Recall@10 against the base llama-nemotron-embed-1b-v2 model before you export anything.
If the gains hold on your real retrieval queries, move straight to ONNX export and a NIM deployment behind /v1/embeddings, then reindex your vector store with the new embeddings.
Get Insanely Good at AI
The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.
Keep Reading
H Company Releases Holotron-12B Computer-Use Agent on Hugging Face
H Company released Holotron-12B, a Nemotron-based multimodal computer-use model touting higher throughput and 80.5% on WebVoyager.
How to Choose a Vector Database in 2026
Pinecone, Weaviate, Qdrant, pgvector, or Chroma? Here's how to pick the right vector database for your AI application based on scale, infrastructure, and actual needs.
How to Build a RAG Application (Step by Step)
A practical walkthrough of building a RAG pipeline from scratch: chunking documents, generating embeddings, storing vectors, retrieving context, and generating grounded answers.
Fine-Tuning vs RAG: When to Use Each Approach
RAG changes what the model knows. Fine-tuning changes how it behaves. Here's when to use each approach, their real tradeoffs, and why the answer is usually both.
What Is RAG? Retrieval-Augmented Generation Explained
RAG lets AI models pull in real data before generating a response. Here's how retrieval-augmented generation works, why it matters, and where it breaks down.