Open Nemotron 3 Nano Omni Merges Mamba2 With Transformers
NVIDIA released Nemotron 3 Nano Omni, a hybrid MoE model combining Mamba2 and Transformer layers to unify agentic reasoning across four modalities.
NVIDIA released Nemotron 3 Nano Omni, an open foundation model designed to process text, image, audio, and video inputs in a single forward pass. Released on April 28, 2026, the model targets agentic workflows by replacing multi-model pipelines with a unified reasoning system. The architecture relies on a 30B-A3B hybrid Mixture-of-Experts design. This configuration activates only 3.6 billion parameters per token out of a total 31.6 billion.
Hybrid Architecture and Encoders
The model achieves its speed by combining Mamba2 sequence layers with traditional Transformer reasoning layers. The Mamba2 components handle sequence efficiency across the 256,000 token context window. This allows the system to ingest lengthy video timelines and large document sets without the typical memory overhead of pure attention mechanisms.
For modality ingestion, Nemotron 3 Nano Omni integrates specialized encoders via lightweight 2-layer MLP projectors. Vision is handled by the C-RADIOv4-H encoder, which supports full HD 1920x1080 resolution. Audio relies on the Parakeet-TDT-0.6B-v2 encoder. By processing mixed-modality inputs natively, the model eliminates inference hops. Developers no longer need to sequence a separate transcription model before passing text to an LLM. If you build multi-modal ai agents, this single-request processing simplifies error handling and state management.
Benchmark Results
NVIDIA reports up to 9x higher throughput and 2.9x faster reasoning speeds compared to existing open multimodal models at equivalent interactivity thresholds. The model was pretrained on 127 billion cross-modal tokens and post-trained on 124 million curated examples using NeMo RL and NeMo Gym.
| Category | Benchmark | Position / Result |
|---|---|---|
| Document Intelligence | MMlongbench-Doc | 1st (Open Omni) |
| Document Intelligence | OCRBenchV2 | 1st (Open Omni) |
| Audio & Video | WorldSense | Leading |
| Audio & Video | DailyOmni | Leading |
| Audio & Video | VoiceBench | Leading |
| Efficiency | MediaPerf | Most cost-efficient |
Compute Allocation and GUI Navigation
The model is specifically tuned for agentic computer use. It interprets high-fidelity GUI states and 1080p screen recordings directly, translating visual interface elements into navigational logic. This native visual processing reduces the need for external OCR or bounding-box extraction steps.
The architecture includes a toggleable reasoning mode. This feature allows developers to dynamically increase the thinking budget on a per-request basis. Complex tasks like multi-step graphical navigation can utilize deeper compute cycles, while simpler extraction tasks run at base speed. Managing this tradeoff is critical when scaling a hybrid MoE in production environments with strict latency requirements.
Ecosystem and Availability
NVIDIA released the model weights, datasets, and training recipes under an open license. The model is currently accessible on Hugging Face, OpenRouter, and fal.ai. Enterprise teams can deploy it as an NVIDIA NIM microservice via build.nvidia.com. Amazon SageMaker JumpStart, Palantir, Dell Technologies, and Foxconn have integrated the model for evaluation in corporate workflows.
If your system currently uses separate vision, transcription, and reasoning models, calculate your combined latency and token costs. Replacing that chain with a single 3.6B active parameter model changes the baseline for how responsive your workflow can be. Evaluate the model on your specific domain data to confirm it meets your accuracy floor before deprecating your existing multi-step pipelines.
Get Insanely Good at AI
The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.
Keep Reading
Build Korean AI Agents with Nemotron Synthetic Personas
Learn how to use NVIDIA Nemotron-Personas-Korea to ground AI agents in authentic South Korean demographics, cultural norms, and honorifics.
Agentic Creativity: Adobe Firefly AI Assistant Automates Apps
Adobe's Firefly AI Assistant acts as a cross-application agent to automate complex creative workflows across Photoshop, Premiere Pro, and Illustrator.
MolmoWeb Outpaces GPT-4o in Visual Web Navigation Tasks
Ai2 releases MolmoWeb, an open-source browser agent that uses visual screenshots to outperform proprietary models on major web navigation benchmarks.
NVIDIA Ships Nemotron 3 Content Safety 4B for On-Device Filtering
NVIDIA released Nemotron 3 Content Safety 4B, a multilingual multimodal moderation model for text and images, on Hugging Face.
H Company Releases Holotron-12B Computer-Use Agent on Hugging Face
H Company released Holotron-12B, a Nemotron-based multimodal computer-use model touting higher throughput and 80.5% on WebVoyager.