NVIDIA Nemotron 3 Super Redefines Agentic AI with Hybrid MoE
NVIDIA's new Nemotron 3 Super combines Mamba and Transformer architectures with a 1-million token context window to power high-speed autonomous reasoning.
NVIDIA released Nemotron 3 Super at the GTC conference, a 120-billion parameter hybrid model designed specifically for autonomous workloads. The architecture combines Mamba-2 state space layers with traditional attention mechanisms to mitigate the computational tax associated with multi-agent systems. With native support for a 1-million token context window, the model targets long-duration reasoning tasks where goal drift typically degrades performance.
Architecture and Latent Routing
The model operates on a hybrid backbone that pairs Mamba-2 layers for linear-time sequence efficiency with Transformer layers acting as global anchors. This design maintains strict reasoning pathways without the quadratic memory scaling of pure attention architectures.
Routing efficiency relies on a new Latent Mixture-of-Experts (MoE) design. The system projects tokens from a 4096-dimensional space down to 1024 dimensions before routing them. This compression allows the model to activate top-22 routing out of 512 total experts. The result is 120.6 billion total parameters with only 12.7 billion active parameters per forward pass.
Inference Optimization and Hardware Scaling
Inference speed dictates the viability of agentic workflows. Nemotron 3 Super uses Multi-Token Prediction (MTP) to forecast multiple future tokens in a single pass. This provides native speculative decoding without requiring a separate draft model, yielding up to a 3x speedup during generation.
The model was pre-trained natively in 4-bit floating point precision (NVFP4). When deployed on NVIDIA Blackwell B200 or B300 GPUs, this format reduces memory footprints and accelerates inference by 4x compared to FP8 execution on earlier Hopper hardware. Open-weight quantized versions are available in BF16, FP8, and NVFP4 formats via Hugging Face and NVIDIA NIM. Integrations are currently active across Together AI, Amazon Bedrock, Perplexity, CodeRabbit, and Palantir.
Benchmark Results and Throughput
Throughput metrics define the production limits of the model. On an 8k input and 64k output workload, Nemotron 3 Super demonstrates throughput gains over competing open-source models in its size class.
| Model | Parameters | SWE-Bench Verified | Relative Throughput |
|---|---|---|---|
| Nemotron 3 Super | 120.6B (12.7B active) | 60.47% | Baseline (1.0x) |
| GPT-OSS-120B | ~120B | Not specified | 0.45x |
| Qwen3.5-122B | 122B | Not specified | 0.13x |
The model secured the top position on the DeepResearch Bench and DeepResearch Bench II leaderboards. Training involved 25 trillion tokens, including a specialized Nemotron-Pretraining-Specialized-v1.1 formal logic and coding dataset. Post-training utilized reinforcement learning across 21 environments via NVIDIA NeMo Gym, totaling 1.2 million environment rollouts.
If you build long-running agents, your infrastructure needs to handle extended context windows without degrading output quality. Transitioning to a hybrid Mamba-Transformer architecture requires evaluating your current KV cache configuration and batching strategies. Evaluate the NVFP4 weights on Blackwell hardware to determine if the throughput gains justify the migration cost for your specific retrieval or coding pipelines.
Get Insanely Good at AI
The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.
Keep Reading
How to Optimize MoE Inference with Warp Decode
Learn how Cursor's warp decode technique uses GPU kernel optimizations and warp-level primitives to achieve 300+ tokens per second on Blackwell hardware.
TML-Interaction-Small Achieves 0.40s Full-Duplex Latency
Thinking Machines Lab has released a research preview of TML-Interaction-Small, a 276-billion-parameter Mixture-of-Experts model for full-duplex conversation.
EMO Pretraining Decouples Mixture-of-Experts Subsets
AI2 and UC Berkeley researchers introduced EMO, a pretraining constraint that groups MoE experts by semantic domain to allow independent subnet deployment.
OpenAI Releases 1.5B Privacy Filter MoE for PII Redaction
OpenAI released an open-weight, 1.5 billion parameter model designed to detect and redact personally identifiable information locally before cloud processing.
GLM-5.1 MoE Beats GPT-5.4 in Open-Source Engineering Milestone
Zhipu AI releases GLM-5.1 under MIT license, a 744B parameter MoE model that outperforms GPT-5.4 on the SWE-Bench Pro software engineering benchmark.