NVIDIA Nemotron 3 Super Redefines Agentic AI with Hybrid MoE
NVIDIA's new Nemotron 3 Super combines Mamba and Transformer architectures with a 1-million token context window to power high-speed autonomous reasoning.
NVIDIA released Nemotron 3 Super at the GTC conference, a 120-billion parameter hybrid model designed specifically for autonomous workloads. The architecture combines Mamba-2 state space layers with traditional attention mechanisms to mitigate the computational tax associated with multi-agent systems. With native support for a 1-million token context window, the model targets long-duration reasoning tasks where goal drift typically degrades performance.
Architecture and Latent Routing
The model operates on a hybrid backbone that pairs Mamba-2 layers for linear-time sequence efficiency with Transformer layers acting as global anchors. This design maintains strict reasoning pathways without the quadratic memory scaling of pure attention architectures.
Routing efficiency relies on a new Latent Mixture-of-Experts (MoE) design. The system projects tokens from a 4096-dimensional space down to 1024 dimensions before routing them. This compression allows the model to activate top-22 routing out of 512 total experts. The result is 120.6 billion total parameters with only 12.7 billion active parameters per forward pass.
Inference Optimization and Hardware Scaling
Inference speed dictates the viability of agentic workflows. Nemotron 3 Super uses Multi-Token Prediction (MTP) to forecast multiple future tokens in a single pass. This provides native speculative decoding without requiring a separate draft model, yielding up to a 3x speedup during generation.
The model was pre-trained natively in 4-bit floating point precision (NVFP4). When deployed on NVIDIA Blackwell B200 or B300 GPUs, this format reduces memory footprints and accelerates inference by 4x compared to FP8 execution on earlier Hopper hardware. Open-weight quantized versions are available in BF16, FP8, and NVFP4 formats via Hugging Face and NVIDIA NIM. Integrations are currently active across Together AI, Amazon Bedrock, Perplexity, CodeRabbit, and Palantir.
Benchmark Results and Throughput
Throughput metrics define the production limits of the model. On an 8k input and 64k output workload, Nemotron 3 Super demonstrates throughput gains over competing open-source models in its size class.
| Model | Parameters | SWE-Bench Verified | Relative Throughput |
|---|---|---|---|
| Nemotron 3 Super | 120.6B (12.7B active) | 60.47% | Baseline (1.0x) |
| GPT-OSS-120B | ~120B | Not specified | 0.45x |
| Qwen3.5-122B | 122B | Not specified | 0.13x |
The model secured the top position on the DeepResearch Bench and DeepResearch Bench II leaderboards. Training involved 25 trillion tokens, including a specialized Nemotron-Pretraining-Specialized-v1.1 formal logic and coding dataset. Post-training utilized reinforcement learning across 21 environments via NVIDIA NeMo Gym, totaling 1.2 million environment rollouts.
If you build long-running agents, your infrastructure needs to handle extended context windows without degrading output quality. Transitioning to a hybrid Mamba-Transformer architecture requires evaluating your current KV cache configuration and batching strategies. Evaluate the NVFP4 weights on Blackwell hardware to determine if the throughput gains justify the migration cost for your specific retrieval or coding pipelines.
Get Insanely Good at AI
The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.
Keep Reading
How Cursor Built Composer 2 on Top of Kimi K2.5
Cursor's Composer 2 is built on Kimi K2.5. What continued pretraining, reinforcement learning, and self-summarization mean, and how they work.
Arm Launches First In-House AGI CPU
Arm unveiled its first production silicon, a 136-core data center CPU for agentic AI workloads, with Meta as lead partner.
What Is Mixture-of-Experts (MoE) in AI?
MoE models have a trillion parameters but only activate a fraction per token. How expert routing works, why it matters for cost, and which major models use it.
NVIDIA Introduces SPEED-Bench for Speculative Decoding
NVIDIA rolled out SPEED-Bench, a benchmark suite and dataset for evaluating speculative decoding across realistic LLM workloads.
NVIDIA Launches Nemotron Coalition at GTC 2026
NVIDIA launched the Nemotron Coalition and expanded its open AI model lineup at GTC 2026, with the first coalition model set for Nemotron 4.