How to Run NVIDIA Nemotron 3 Nano 4B Locally on Jetson and RTX
Learn to deploy NVIDIA's Nemotron 3 Nano 4B locally with BF16, FP8, or GGUF on Jetson, RTX, vLLM, TensorRT-LLM, and llama.cpp.
NVIDIA’s Nemotron 3 Nano 4B gives you a compact local model for Jetson and RTX systems, with launch-day availability in BF16, FP8, and GGUF Q4_K_M formats. The March 17 release focused on edge deployment, and the official announcement plus model cards provide the core setup details you need to choose a runtime, pick the right quantization, and run it on local NVIDIA hardware.
This model is aimed at on-device conversational agents, tool use, gaming agents, voice assistants, and embedded AI. If you are still deciding whether a fully local model fits your product, it helps to compare that architecture choice with AI Agents vs Chatbots: What’s the Difference? and Fine-Tuning vs RAG: When to Use Each Approach.
What you can run, and where
Nemotron 3 Nano 4B is a 3.97B-parameter Mamba2-Transformer hybrid model. NVIDIA says it is their first model specifically optimized for on-device deployment.
The published variants are:
| Variant | Format | Notes |
|---|---|---|
| NVIDIA-Nemotron-3-Nano-4B-BF16 | BF16 | Full-precision deployment option |
| NVIDIA-Nemotron-3-Nano-4B-FP8 | FP8 | Quantized with ModelOpt, mixed with selected BF16 layers |
| NVIDIA-Nemotron-3-Nano-4B-GGUF | GGUF Q4_K_M | 4-bit llama.cpp-friendly variant |
The model card lists support for Transformers, vLLM, TensorRT-LLM, and llama.cpp. It also lists supported hardware including A10G, H100-80GB, A100, and GeForce RTX, while the release blog additionally positions the model for Jetson Orin Nano, Jetson Thor, and DGX Spark.
For local deployment, the practical split is simple:
| Hardware | Best starting format | Best starting runtime |
|---|---|---|
| Jetson Orin Nano 8GB | GGUF Q4_K_M | llama.cpp |
| Jetson Thor | FP8 or GGUF | TensorRT-LLM or llama.cpp |
| GeForce RTX | GGUF Q4_K_M or BF16 | llama.cpp, Transformers, or vLLM |
| Larger datacenter GPU | BF16 or FP8 | vLLM or TensorRT-LLM |
If you want a broader local-model primer before choosing a serving stack, see How to Run LLMs Locally on Your Machine.
Start with the right model variant
Your first decision is memory budget versus quality versus runtime support.
Use GGUF Q4_K_M for Jetson and lightweight local setups
The GGUF release is the easiest fit for constrained devices. NVIDIA’s GGUF model card lists:
- Quantization: Q4_K_M
- File size: 2.84 GB
- Architecture: nemotron_h
The release blog says this variant reaches 18 tokens/s on Jetson Orin Nano 8GB, and delivers up to 2× higher throughput than Nemotron Nano 9B v2 on that device.
This is the safest starting point if you are targeting embedded inference or need a quick local proof of concept on Jetson.
Use BF16 when you want the reference-quality checkpoint
The BF16 checkpoint is the baseline NVIDIA uses for published benchmark numbers and compatibility claims. Choose it when you have more GPU memory available and want the least quantization-related behavior change.
The BF16 model card also exposes the model’s 262,144-token context length and notes support for NeMo 25.07 integration.
Use FP8 when your stack supports it and throughput matters
NVIDIA says the FP8 model uses post-training quantization via ModelOpt, calibrated on a 1K-sample subset from the SFT data. In this version:
- all 4 self-attention layers stay in BF16
- the 4 Mamba layers preceding those attention layers stay in BF16
- all Conv1D inside Mamba layers stay in BF16
- weights, activations, and KV cache are quantized to FP8
According to the release blog, this delivers up to 1.8× latency/throughput improvement over BF16 on DGX Spark and Jetson Thor.
Installation and setup paths
The release materials confirm ecosystem support, but they do not provide one unified installation block across all runtimes. Use the official model pages and runtime docs for exact commands:
- Official release post
- BF16 model card
- GGUF model card
- Jetson AI Lab models page
- TensorRT-LLM releases
Two compatibility details matter immediately:
| Component | Requirement from published docs |
|---|---|
| vLLM | 0.15.1 or newer |
| Jetson support | Jetson AI Lab lists day-0 llama.cpp support on Jetson Orin and Thor |
Because the research does not include verified install commands from the official docs, refer to the linked runtime documentation for implementation details.
Running on Jetson with llama.cpp
For Jetson Orin Nano and similar edge devices, llama.cpp with the GGUF model is the practical path. That combination is the one NVIDIA highlights for embedded throughput, and Jetson AI Lab explicitly notes day-0 support.
Use this path when:
- your system has tight memory limits
- startup speed matters
- you want a small downloadable artifact
- your app is interactive and local-first
The main tradeoff is benchmark drift relative to BF16. NVIDIA’s own GGUF card shows the quantized model preserves long-context performance well on RULER (128k), where scores are essentially unchanged, but some instruction and agent-style metrics move more noticeably.
Published BF16 vs FP8 vs GGUF results
These reasoning-off numbers come directly from NVIDIA’s published cards:
| Benchmark | BF16 | FP8 | GGUF Q4_K_M |
|---|---|---|---|
| IFBench-Prompt | 43.2 | 43.88 | 46.9 |
| IFBench-Instruction | 44.2 | 44.78 | 49.6 |
| Orak | 22.9 | 20.72 | 19.8 |
| IFEval-Instruction | 88.0 | 87.53 | 83.9 |
| RULER (128k) | 91.1 | 91.0 | 91.2 |
That pattern matters if your application depends on tool use, instruction precision, or long-context retrieval. For context-heavy workflows, also review Context Windows Explained: Why Your AI Forgets and What Tokenization Means for Your Prompts.
Running on RTX with BF16, GGUF, or vLLM
On GeForce RTX, you have more flexibility.
Use GGUF Q4_K_M with llama.cpp if you want the simplest local inference path and low VRAM footprint. NVIDIA says the model achieved the lowest VRAM footprint and lowest TTFT under high-ISL settings on an RTX 4070 using llama.cpp and Q4_K_M models.
Use BF16 with Transformers or vLLM if you need the reference checkpoint and stronger integration with Python application stacks. The model card explicitly supports both.
Use vLLM only if you are on version 0.15.1 or newer. That minimum version is called out in the BF16 model card.
The source materials confirm support, but they do not include a verified runnable code block for Transformers or vLLM in the research provided here. Refer to the BF16 model card for implementation details and runtime-specific examples as they are updated.
Reasoning-on vs reasoning-off
One of the distinctive features in this release is the model’s reasoning-on / reasoning-off behavior.
NVIDIA says the model can either:
- emit an internal reasoning trace first, or
- skip it for lower-latency responses
That gives you a concrete tuning lever:
| Mode | Best for | Tradeoff |
|---|---|---|
| Reasoning-off | Fast interactive chat, game agents, low-latency local assistants | Lower performance on some reasoning-heavy tasks |
| Reasoning-on | Math, harder planning tasks, structured multi-step work | More latency and more output tokens |
NVIDIA’s BF16 card shows clear gains in reasoning-on mode for several benchmarks:
| Benchmark | Reasoning-off | Reasoning-on |
|---|---|---|
| IFEval-Prompt | 82.8 | 87.9 |
| IFEval-Instruction | 88.0 | 92.0 |
| Tau2-Airline | 28.0 | 33.3 |
| Tau2-Retail | 34.8 | 39.8 |
| Tau2-Telecom | 24.9 | 33.0 |
If your application depends on structured outputs or tool-calling loops, the release notes are relevant here. NVIDIA says post-training included reinforcement stages targeting instruction following, structured output, and multi-turn conversational tool use. For application-side control, pair that with strong prompt design and output validation, using patterns from Structured Output from LLMs: JSON Mode Explained and Prompt Engineering Guide: How to Write Better AI Prompts.
Limits and tradeoffs to plan for
This model is compact, but it is still tuned around NVIDIA’s ecosystem and deployment story.
A few constraints stand out from the published materials:
- Runtime setup varies by backend, and the release docs do not provide one canonical install flow for every framework.
- vLLM requires 0.15.1+.
- Quantized formats change benchmark behavior, especially outside long-context retrieval.
- The exact competitor comparison table from the blog was not fully available in the accessible excerpt, so use NVIDIA’s direct benchmark cards rather than paraphrased comparisons.
- Reasoning-on mode adds token and latency overhead, which matters on embedded devices.
The model is also targeted toward specific workloads: gaming NPCs, local assistants, IoT automation, robotics, and tool-calling conversational agents. If your use case is retrieval-heavy or multi-step agent orchestration, connect the model choice to your app design, not just the raw benchmark numbers. These two guides are useful reference points: How to Build a RAG Application (Step by Step) and AI Agent Frameworks Compared: LangChain vs CrewAI vs LlamaIndex.
When to choose BF16, FP8, or GGUF
Use this quick rule:
| Choose this | When you need |
|---|---|
| GGUF Q4_K_M | Small footprint, Jetson deployment, llama.cpp compatibility, fastest path to local inference |
| BF16 | Reference-quality checkpoint, maximum fidelity to the published baseline, larger GPU memory budget |
| FP8 | Better throughput on supported NVIDIA stacks, especially DGX Spark or Jetson Thor |
For most developers starting today, the most practical rollout path is:
- Start with GGUF Q4_K_M on llama.cpp for Jetson or quick RTX validation.
- Move to BF16 if your app depends on tighter instruction accuracy or less quantization variance.
- Test FP8 on supported NVIDIA systems when throughput is the bottleneck.
Download the variant that matches your hardware first, then validate your actual prompt set in both reasoning-off and reasoning-on modes before wiring it into production agents.
Get Insanely Good at AI
The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.
Keep Reading
NVIDIA Unveils NemoClaw at GTC as a Security-Focused Enterprise AI Agent Platform
NVIDIA introduced NemoClaw, an alpha open-source enterprise agent platform built to add security and privacy controls to OpenClaw workflows.
How to Get Started with Open-H, GR00T-H, and Cosmos-H for Healthcare Robotics Research
Learn how to use NVIDIA's new Open-H dataset and GR00T-H and Cosmos-H models to build and evaluate healthcare robotics systems.
How to Run IBM Granite 4.0 1B Speech for Multilingual Edge ASR and Translation
Learn how to deploy IBM Granite 4.0 1B Speech for fast multilingual ASR and translation on edge devices.
How to Run LLMs Locally on Your Machine
Running AI models locally gives you privacy, speed, and zero API costs. Here's what hardware you need, which tools to use, and how to choose the right model.
Mistral Launches Forge to Let Enterprises Build Custom Frontier AI Models
Mistral unveiled Forge at NVIDIA GTC, giving enterprises tools to pre-train, post-train, and refine custom AI on proprietary data.