Ai Engineering 8 min read

How to Run NVIDIA Nemotron 3 Nano 4B Locally on Jetson and RTX

Learn to deploy NVIDIA's Nemotron 3 Nano 4B locally with BF16, FP8, or GGUF on Jetson, RTX, vLLM, TensorRT-LLM, and llama.cpp.

NVIDIA’s Nemotron 3 Nano 4B gives you a compact local model for Jetson and RTX systems, with launch-day availability in BF16, FP8, and GGUF Q4_K_M formats. The March 17 release focused on edge deployment, and the official announcement plus model cards provide the core setup details you need to choose a runtime, pick the right quantization, and run it on local NVIDIA hardware.

This model is aimed at on-device conversational agents, tool use, gaming agents, voice assistants, and embedded AI. If you are still deciding whether a fully local model fits your product, it helps to compare that architecture choice with AI Agents vs Chatbots: What’s the Difference? and Fine-Tuning vs RAG: When to Use Each Approach.

What you can run, and where

Nemotron 3 Nano 4B is a 3.97B-parameter Mamba2-Transformer hybrid model. NVIDIA says it is their first model specifically optimized for on-device deployment.

The published variants are:

VariantFormatNotes
NVIDIA-Nemotron-3-Nano-4B-BF16BF16Full-precision deployment option
NVIDIA-Nemotron-3-Nano-4B-FP8FP8Quantized with ModelOpt, mixed with selected BF16 layers
NVIDIA-Nemotron-3-Nano-4B-GGUFGGUF Q4_K_M4-bit llama.cpp-friendly variant

The model card lists support for Transformers, vLLM, TensorRT-LLM, and llama.cpp. It also lists supported hardware including A10G, H100-80GB, A100, and GeForce RTX, while the release blog additionally positions the model for Jetson Orin Nano, Jetson Thor, and DGX Spark.

For local deployment, the practical split is simple:

HardwareBest starting formatBest starting runtime
Jetson Orin Nano 8GBGGUF Q4_K_Mllama.cpp
Jetson ThorFP8 or GGUFTensorRT-LLM or llama.cpp
GeForce RTXGGUF Q4_K_M or BF16llama.cpp, Transformers, or vLLM
Larger datacenter GPUBF16 or FP8vLLM or TensorRT-LLM

If you want a broader local-model primer before choosing a serving stack, see How to Run LLMs Locally on Your Machine.

Start with the right model variant

Your first decision is memory budget versus quality versus runtime support.

Use GGUF Q4_K_M for Jetson and lightweight local setups

The GGUF release is the easiest fit for constrained devices. NVIDIA’s GGUF model card lists:

  • Quantization: Q4_K_M
  • File size: 2.84 GB
  • Architecture: nemotron_h

The release blog says this variant reaches 18 tokens/s on Jetson Orin Nano 8GB, and delivers up to 2× higher throughput than Nemotron Nano 9B v2 on that device.

This is the safest starting point if you are targeting embedded inference or need a quick local proof of concept on Jetson.

Use BF16 when you want the reference-quality checkpoint

The BF16 checkpoint is the baseline NVIDIA uses for published benchmark numbers and compatibility claims. Choose it when you have more GPU memory available and want the least quantization-related behavior change.

The BF16 model card also exposes the model’s 262,144-token context length and notes support for NeMo 25.07 integration.

Use FP8 when your stack supports it and throughput matters

NVIDIA says the FP8 model uses post-training quantization via ModelOpt, calibrated on a 1K-sample subset from the SFT data. In this version:

  • all 4 self-attention layers stay in BF16
  • the 4 Mamba layers preceding those attention layers stay in BF16
  • all Conv1D inside Mamba layers stay in BF16
  • weights, activations, and KV cache are quantized to FP8

According to the release blog, this delivers up to 1.8× latency/throughput improvement over BF16 on DGX Spark and Jetson Thor.

Installation and setup paths

The release materials confirm ecosystem support, but they do not provide one unified installation block across all runtimes. Use the official model pages and runtime docs for exact commands:

Two compatibility details matter immediately:

ComponentRequirement from published docs
vLLM0.15.1 or newer
Jetson supportJetson AI Lab lists day-0 llama.cpp support on Jetson Orin and Thor

Because the research does not include verified install commands from the official docs, refer to the linked runtime documentation for implementation details.

Running on Jetson with llama.cpp

For Jetson Orin Nano and similar edge devices, llama.cpp with the GGUF model is the practical path. That combination is the one NVIDIA highlights for embedded throughput, and Jetson AI Lab explicitly notes day-0 support.

Use this path when:

  • your system has tight memory limits
  • startup speed matters
  • you want a small downloadable artifact
  • your app is interactive and local-first

The main tradeoff is benchmark drift relative to BF16. NVIDIA’s own GGUF card shows the quantized model preserves long-context performance well on RULER (128k), where scores are essentially unchanged, but some instruction and agent-style metrics move more noticeably.

Published BF16 vs FP8 vs GGUF results

These reasoning-off numbers come directly from NVIDIA’s published cards:

BenchmarkBF16FP8GGUF Q4_K_M
IFBench-Prompt43.243.8846.9
IFBench-Instruction44.244.7849.6
Orak22.920.7219.8
IFEval-Instruction88.087.5383.9
RULER (128k)91.191.091.2

That pattern matters if your application depends on tool use, instruction precision, or long-context retrieval. For context-heavy workflows, also review Context Windows Explained: Why Your AI Forgets and What Tokenization Means for Your Prompts.

Running on RTX with BF16, GGUF, or vLLM

On GeForce RTX, you have more flexibility.

Use GGUF Q4_K_M with llama.cpp if you want the simplest local inference path and low VRAM footprint. NVIDIA says the model achieved the lowest VRAM footprint and lowest TTFT under high-ISL settings on an RTX 4070 using llama.cpp and Q4_K_M models.

Use BF16 with Transformers or vLLM if you need the reference checkpoint and stronger integration with Python application stacks. The model card explicitly supports both.

Use vLLM only if you are on version 0.15.1 or newer. That minimum version is called out in the BF16 model card.

The source materials confirm support, but they do not include a verified runnable code block for Transformers or vLLM in the research provided here. Refer to the BF16 model card for implementation details and runtime-specific examples as they are updated.

Reasoning-on vs reasoning-off

One of the distinctive features in this release is the model’s reasoning-on / reasoning-off behavior.

NVIDIA says the model can either:

  • emit an internal reasoning trace first, or
  • skip it for lower-latency responses

That gives you a concrete tuning lever:

ModeBest forTradeoff
Reasoning-offFast interactive chat, game agents, low-latency local assistantsLower performance on some reasoning-heavy tasks
Reasoning-onMath, harder planning tasks, structured multi-step workMore latency and more output tokens

NVIDIA’s BF16 card shows clear gains in reasoning-on mode for several benchmarks:

BenchmarkReasoning-offReasoning-on
IFEval-Prompt82.887.9
IFEval-Instruction88.092.0
Tau2-Airline28.033.3
Tau2-Retail34.839.8
Tau2-Telecom24.933.0

If your application depends on structured outputs or tool-calling loops, the release notes are relevant here. NVIDIA says post-training included reinforcement stages targeting instruction following, structured output, and multi-turn conversational tool use. For application-side control, pair that with strong prompt design and output validation, using patterns from Structured Output from LLMs: JSON Mode Explained and Prompt Engineering Guide: How to Write Better AI Prompts.

Limits and tradeoffs to plan for

This model is compact, but it is still tuned around NVIDIA’s ecosystem and deployment story.

A few constraints stand out from the published materials:

  • Runtime setup varies by backend, and the release docs do not provide one canonical install flow for every framework.
  • vLLM requires 0.15.1+.
  • Quantized formats change benchmark behavior, especially outside long-context retrieval.
  • The exact competitor comparison table from the blog was not fully available in the accessible excerpt, so use NVIDIA’s direct benchmark cards rather than paraphrased comparisons.
  • Reasoning-on mode adds token and latency overhead, which matters on embedded devices.

The model is also targeted toward specific workloads: gaming NPCs, local assistants, IoT automation, robotics, and tool-calling conversational agents. If your use case is retrieval-heavy or multi-step agent orchestration, connect the model choice to your app design, not just the raw benchmark numbers. These two guides are useful reference points: How to Build a RAG Application (Step by Step) and AI Agent Frameworks Compared: LangChain vs CrewAI vs LlamaIndex.

When to choose BF16, FP8, or GGUF

Use this quick rule:

Choose thisWhen you need
GGUF Q4_K_MSmall footprint, Jetson deployment, llama.cpp compatibility, fastest path to local inference
BF16Reference-quality checkpoint, maximum fidelity to the published baseline, larger GPU memory budget
FP8Better throughput on supported NVIDIA stacks, especially DGX Spark or Jetson Thor

For most developers starting today, the most practical rollout path is:

  1. Start with GGUF Q4_K_M on llama.cpp for Jetson or quick RTX validation.
  2. Move to BF16 if your app depends on tighter instruction accuracy or less quantization variance.
  3. Test FP8 on supported NVIDIA systems when throughput is the bottleneck.

Download the variant that matches your hardware first, then validate your actual prompt set in both reasoning-off and reasoning-on modes before wiring it into production agents.

Get Insanely Good at AI

Get Insanely Good at AI

The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.

Keep Reading