How to Run NVIDIA Nemotron 3 Nano 4B Locally on Jetson and RTX

NVIDIA’s Nemotron 3 Nano 4B gives you a compact local model for Jetson and RTX systems, with launch-day availability in BF16, FP8, and GGUF Q4_K_M formats. The March 17 release focused on edge deployment, and the official announcement plus model cards provide the core setup details you need to choose a runtime, pick the right quantization, and run it on local NVIDIA hardware.

This model is aimed at on-device conversational agents, tool use, gaming agents, voice assistants, and embedded AI. If you are still deciding whether a fully local model fits your product, it helps to compare that architecture choice with AI Agents vs Chatbots: What’s the Difference? and Fine-Tuning vs RAG: When to Use Each Approach.

What you can run, and where

Nemotron 3 Nano 4B is a 3.97B-parameter Mamba-Transformer hybrid model. NVIDIA says it is their first model specifically optimized for on-device deployment.

The published variants are:

Variant	Format	Notes
NVIDIA-Nemotron-3-Nano-4B-BF16	BF16	Full-precision deployment option
NVIDIA-Nemotron-3-Nano-4B-FP8	FP8	Quantized with ModelOpt, mixed with selected BF16 layers
NVIDIA-Nemotron-3-Nano-4B-GGUF	GGUF Q4_K_M	4-bit llama.cpp-friendly variant

The model card lists support for Transformers, vLLM, TensorRT-LLM, SGLang, and llama.cpp. Test hardware includes GeForce RTX, H100 80GB, DGX Spark, Jetson Thor, and Jetson Orin Nano.

For local deployment, the practical split is simple:

Hardware	Best starting format	Best starting runtime
Jetson Orin Nano 8GB	GGUF Q4_K_M	llama.cpp
Jetson Thor	FP8 or GGUF	TensorRT-LLM or llama.cpp
GeForce RTX	GGUF Q4_K_M or BF16	llama.cpp, Transformers, or vLLM
Larger datacenter GPU	BF16 or FP8	vLLM or TensorRT-LLM

If you want a broader local-model primer before choosing a serving stack, see How to Run LLMs Locally on Your Machine.

Start with the right model variant

Your first decision is memory budget versus quality versus runtime support.

Use GGUF Q4_K_M for Jetson and lightweight local setups

The GGUF release is the easiest fit for constrained devices. NVIDIA’s GGUF model card lists:

Quantization: Q4_K_M
File size: 2.84 GB
Architecture: nemotron_h

The release blog says this variant reaches 18 tokens/s on Jetson Orin Nano 8GB, and delivers up to 2× higher throughput than Nemotron Nano 9B v2 on that device.

This is the safest starting point if you are targeting embedded inference or need a quick local proof of concept on Jetson.

Use BF16 when you want the reference-quality checkpoint

The BF16 checkpoint is the baseline NVIDIA uses for published benchmark numbers and compatibility claims. Choose it when you have more GPU memory available and want the least quantization-related behavior change.

The BF16 model card exposes the model’s 262,144-token context length and includes runnable code examples for Transformers, vLLM, and TRT-LLM.

Use FP8 when your stack supports it and throughput matters

NVIDIA says the FP8 model uses post-training quantization via ModelOpt, calibrated on a 1K-sample subset from the SFT data. In this version:

all 4 self-attention layers stay in BF16
the 4 Mamba layers preceding those attention layers stay in BF16
all Conv1D inside Mamba layers stay in BF16
weights, activations, and KV cache are quantized to FP8

According to the release blog, this delivers up to 1.8× latency/throughput improvement over BF16 on DGX Spark and Jetson Thor.

Installation and setup paths

Each model variant has a dedicated Hugging Face repository with runnable examples. For the GGUF variant, the official GGUF model card provides this llama-server command as the quickest local start:

./llama-server -hf nvidia/NVIDIA-Nemotron-3-Nano-4B-GGUF:Q4_K_M -c 0 --alias my_model -ngl 999 --port 5000 --host 0.0.0.0

The -ngl 999 flag offloads all layers to GPU. -c 0 lets the model use its full context length automatically.

Two compatibility details matter immediately:

Component	Requirement
vLLM	0.15.1 or newer
Jetson support	Jetson AI Lab provides day-0 llama.cpp support on Jetson Orin and Thor

For Jetson-specific setup, the Jetson AI Lab models page has step-by-step instructions. The BF16 model card covers Transformers, vLLM, and TRT-LLM usage.

Running on Jetson with llama.cpp

For Jetson Orin Nano and similar edge devices, llama.cpp with the GGUF model is the practical path. That combination is the one NVIDIA highlights for embedded throughput, and Jetson AI Lab explicitly notes day-0 support.

Use this path when:

your system has tight memory limits
startup speed matters
you want a small downloadable artifact
your app is interactive and local-first

The main tradeoff is benchmark drift relative to BF16. NVIDIA’s own GGUF card shows the quantized model preserves long-context performance well on RULER (128k), where scores are essentially unchanged, but some instruction and agent-style metrics move more noticeably.

Published BF16 vs FP8 vs GGUF results

These reasoning-off numbers come directly from NVIDIA’s published cards:

Benchmark	BF16	FP8	GGUF Q4_K_M
IFBench-Prompt	43.2	43.88	46.9
IFBench-Instruction	44.2	44.78	49.6
Orak	22.9	20.72	19.8
IFEval-Instruction	88.0	87.53	83.9
RULER (128k)	91.1	91.0	91.2

That pattern matters if your application depends on tool use, instruction precision, or long-context retrieval. For context-heavy workflows, also review Context Windows Explained: Why Your AI Forgets and What Tokenization Means for Your Prompts.

Running on RTX with BF16, GGUF, or vLLM

On GeForce RTX, you have more flexibility.

Use GGUF Q4_K_M with llama.cpp if you want the simplest local inference path and low VRAM footprint. NVIDIA says the model achieved the lowest VRAM footprint and lowest TTFT under high-ISL settings on an RTX 4070 using llama.cpp and Q4_K_M models.

Use BF16 with Transformers or vLLM if you need the reference checkpoint and stronger integration with Python application stacks. The model card explicitly supports both.

Use vLLM only if you are on version 0.15.1 or newer. That minimum version is called out in the BF16 model card.

For Transformers, the model card provides this minimal setup:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("nvidia/NVIDIA-Nemotron-3-Nano-4B")
model = AutoModelForCausalLM.from_pretrained(
    "nvidia/NVIDIA-Nemotron-3-Nano-4B",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto"
)

For vLLM, the model requires a custom reasoning parser. Install the minimum version and download the parser before serving:

pip install -U "vllm>=0.15.1"
wget https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16/resolve/main/nano_v3_reasoning_parser.py

Full server launch options and inference examples are in the BF16 model card.

Reasoning-on vs reasoning-off

One of the distinctive features in this release is the model’s reasoning-on / reasoning-off behavior.

NVIDIA says the model can either:

emit an internal reasoning trace first, or
skip it for lower-latency responses

That gives you a concrete tuning lever:

Mode	Best for	Tradeoff
Reasoning-off	Fast interactive chat, game agents, low-latency local assistants	Lower performance on some reasoning-heavy tasks
Reasoning-on	Math, harder planning tasks, structured multi-step work	More latency and more output tokens

NVIDIA’s BF16 card shows clear gains in reasoning-on mode for several benchmarks:

Benchmark	Reasoning-off	Reasoning-on
IFEval-Prompt	82.8	87.9
IFEval-Instruction	88.0	92.0
Tau2-Airline	28.0	33.3
Tau2-Retail	34.8	39.8
Tau2-Telecom	24.9	33.0

If your application depends on structured outputs or tool-calling loops, the release notes are relevant here. NVIDIA says post-training included reinforcement stages targeting instruction following, structured output, and multi-turn conversational tool use. For application-side control, pair that with strong prompt design and output validation, using patterns from Structured Output from LLMs: JSON Mode Explained and Prompt Engineering Guide: How to Write Better AI Prompts.

Limits and tradeoffs to plan for

This model is compact, but it is still tuned around NVIDIA’s ecosystem and deployment story.

A few constraints stand out from the published materials:

Runtime setup varies by backend: Transformers, vLLM, TRT-LLM, and llama.cpp each have separate setup paths across different model cards.
vLLM requires 0.15.1+.
Quantized formats change benchmark behavior, especially outside long-context retrieval.
Reasoning-on mode adds token and latency overhead, which matters on embedded devices.

The model is also targeted toward specific workloads: gaming NPCs, local assistants, IoT automation, robotics, and tool-calling conversational agents. If your use case is retrieval-heavy or multi-step agent orchestration, connect the model choice to your app design, not just the raw benchmark numbers. These two guides are useful reference points: How to Build a RAG Application (Step by Step) and AI Agent Frameworks Compared: LangChain vs CrewAI vs LlamaIndex.

When to choose BF16, FP8, or GGUF

Use this quick rule:

Choose this	When you need
GGUF Q4_K_M	Small footprint, Jetson deployment, llama.cpp compatibility, fastest path to local inference
BF16	Reference-quality checkpoint, maximum fidelity to the published baseline, larger GPU memory budget
FP8	Better throughput on supported NVIDIA stacks, especially DGX Spark or Jetson Thor

For most developers starting today, the most practical rollout path is:

Start with GGUF Q4_K_M on llama.cpp for Jetson or quick RTX validation.
Move to BF16 if your app depends on tighter instruction accuracy or less quantization variance.
Test FP8 on supported NVIDIA systems when throughput is the bottleneck.

Download the variant that matches your hardware first, then validate your actual prompt set in both reasoning-off and reasoning-on modes before wiring it into production agents.

How to Run NVIDIA Nemotron 3 Nano 4B Locally on Jetson and RTX

What you can run, and where

Start with the right model variant

Use GGUF Q4_K_M for Jetson and lightweight local setups

Use BF16 when you want the reference-quality checkpoint

Use FP8 when your stack supports it and throughput matters

Installation and setup paths

Running on Jetson with llama.cpp

Published BF16 vs FP8 vs GGUF results

Running on RTX with BF16, GGUF, or vLLM

Reasoning-on vs reasoning-off

Limits and tradeoffs to plan for

When to choose BF16, FP8, or GGUF

Keep Reading

Isaac GR00T Open Humanoid Platform Runs on Jetson Thor T5000

Google AI Edge Taps Arm SME2 for 5x Faster CPU Inference

NVIDIA Nemotron 3 Super Redefines Agentic AI with Hybrid MoE

What Is Quantization in AI?

NVIDIA Introduces SPEED-Bench for Speculative Decoding