Ai Engineering 3 min read

NVIDIA Nemotron 3 Super Redefines Agentic AI with Hybrid MoE

NVIDIA's new Nemotron 3 Super combines Mamba and Transformer architectures with a 1-million token context window to power high-speed autonomous reasoning.

NVIDIA released Nemotron 3 Super at the GTC conference, a 120-billion parameter hybrid model designed specifically for autonomous workloads. The architecture combines Mamba-2 state space layers with traditional attention mechanisms to mitigate the computational tax associated with multi-agent systems. With native support for a 1-million token context window, the model targets long-duration reasoning tasks where goal drift typically degrades performance.

Architecture and Latent Routing

The model operates on a hybrid backbone that pairs Mamba-2 layers for linear-time sequence efficiency with Transformer layers acting as global anchors. This design maintains strict reasoning pathways without the quadratic memory scaling of pure attention architectures.

Routing efficiency relies on a new Latent Mixture-of-Experts (MoE) design. The system projects tokens from a 4096-dimensional space down to 1024 dimensions before routing them. This compression allows the model to activate top-22 routing out of 512 total experts. The result is 120.6 billion total parameters with only 12.7 billion active parameters per forward pass.

Inference Optimization and Hardware Scaling

Inference speed dictates the viability of agentic workflows. Nemotron 3 Super uses Multi-Token Prediction (MTP) to forecast multiple future tokens in a single pass. This provides native speculative decoding without requiring a separate draft model, yielding up to a 3x speedup during generation.

The model was pre-trained natively in 4-bit floating point precision (NVFP4). When deployed on NVIDIA Blackwell B200 or B300 GPUs, this format reduces memory footprints and accelerates inference by 4x compared to FP8 execution on earlier Hopper hardware. Open-weight quantized versions are available in BF16, FP8, and NVFP4 formats via Hugging Face and NVIDIA NIM. Integrations are currently active across Together AI, Amazon Bedrock, Perplexity, CodeRabbit, and Palantir.

Benchmark Results and Throughput

Throughput metrics define the production limits of the model. On an 8k input and 64k output workload, Nemotron 3 Super demonstrates throughput gains over competing open-source models in its size class.

ModelParametersSWE-Bench VerifiedRelative Throughput
Nemotron 3 Super120.6B (12.7B active)60.47%Baseline (1.0x)
GPT-OSS-120B~120BNot specified0.45x
Qwen3.5-122B122BNot specified0.13x

The model secured the top position on the DeepResearch Bench and DeepResearch Bench II leaderboards. Training involved 25 trillion tokens, including a specialized Nemotron-Pretraining-Specialized-v1.1 formal logic and coding dataset. Post-training utilized reinforcement learning across 21 environments via NVIDIA NeMo Gym, totaling 1.2 million environment rollouts.

If you build long-running agents, your infrastructure needs to handle extended context windows without degrading output quality. Transitioning to a hybrid Mamba-Transformer architecture requires evaluating your current KV cache configuration and batching strategies. Evaluate the NVFP4 weights on Blackwell hardware to determine if the throughput gains justify the migration cost for your specific retrieval or coding pipelines.

Get Insanely Good at AI

Get Insanely Good at AI

The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.

Keep Reading