Open Nemotron 3 Nano Omni Merges Mamba2 With Transformers

NVIDIA released Nemotron 3 Nano Omni, an open foundation model designed to process text, image, audio, and video inputs in a single forward pass. Released on April 28, 2026, the model targets agentic workflows by replacing multi-model pipelines with a unified reasoning system. The architecture relies on a 30B-A3B hybrid Mixture-of-Experts design. This configuration activates only 3.6 billion parameters per token out of a total 31.6 billion.

Hybrid Architecture and Encoders

The model achieves its speed by combining Mamba2 sequence layers with traditional Transformer reasoning layers. The Mamba2 components handle sequence efficiency across the 256,000 token context window. This allows the system to ingest lengthy video timelines and large document sets without the typical memory overhead of pure attention mechanisms.

For modality ingestion, Nemotron 3 Nano Omni integrates specialized encoders via lightweight 2-layer MLP projectors. Vision is handled by the C-RADIOv4-H encoder, which supports full HD 1920x1080 resolution. Audio relies on the Parakeet-TDT-0.6B-v2 encoder. By processing mixed-modality inputs natively, the model eliminates inference hops. Developers no longer need to sequence a separate transcription model before passing text to an LLM. If you build multi-modal ai agents, this single-request processing simplifies error handling and state management.

Benchmark Results

NVIDIA reports up to 9x higher throughput and 2.9x faster reasoning speeds compared to existing open multimodal models at equivalent interactivity thresholds. The model was pretrained on 127 billion cross-modal tokens and post-trained on 124 million curated examples using NeMo RL and NeMo Gym.

Category	Benchmark	Position / Result
Document Intelligence	MMlongbench-Doc	1st (Open Omni)
Document Intelligence	OCRBenchV2	1st (Open Omni)
Audio & Video	WorldSense	Leading
Audio & Video	DailyOmni	Leading
Audio & Video	VoiceBench	Leading
Efficiency	MediaPerf	Most cost-efficient

The model is specifically tuned for agentic computer use. It interprets high-fidelity GUI states and 1080p screen recordings directly, translating visual interface elements into navigational logic. This native visual processing reduces the need for external OCR or bounding-box extraction steps.

The architecture includes a toggleable reasoning mode. This feature allows developers to dynamically increase the thinking budget on a per-request basis. Complex tasks like multi-step graphical navigation can utilize deeper compute cycles, while simpler extraction tasks run at base speed. Managing this tradeoff is critical when scaling a hybrid MoE in production environments with strict latency requirements.

Ecosystem and Availability

NVIDIA released the model weights, datasets, and training recipes under an open license. The model is currently accessible on Hugging Face, OpenRouter, and fal.ai. Enterprise teams can deploy it as an NVIDIA NIM microservice via build.nvidia.com. Amazon SageMaker JumpStart, Palantir, Dell Technologies, and Foxconn have integrated the model for evaluation in corporate workflows.

If your system currently uses separate vision, transcription, and reasoning models, calculate your combined latency and token costs. Replacing that chain with a single 3.6B active parameter model changes the baseline for how responsive your workflow can be. Evaluate the model on your specific domain data to confirm it meets your accuracy floor before deprecating your existing multi-step pipelines.

Open Nemotron 3 Nano Omni Merges Mamba2 With Transformers

Hybrid Architecture and Encoders

Benchmark Results

Compute Allocation and GUI Navigation

Ecosystem and Availability

Keep Reading

Build Korean AI Agents with Nemotron Synthetic Personas

Agentic Creativity: Adobe Firefly AI Assistant Automates Apps

MolmoWeb Outpaces GPT-4o in Visual Web Navigation Tasks

NVIDIA Ships Nemotron 3 Content Safety 4B for On-Device Filtering

H Company Releases Holotron-12B Computer-Use Agent on Hugging Face