Ai Engineering 3 min read

Cosmos 3 Open Omnimodel Merges World Simulation and Action

NVIDIA released Cosmos 3, an open-weight omnimodel that unifies vision reasoning, world simulation, and action prediction for physical AI applications.

NVIDIA released Cosmos 3 at GTC Taipei, an open omnimodel built specifically for physical AI workloads. The system unifies vision reasoning, world simulation, and action prediction into a single architecture. This combined pipeline compresses the training and evaluation cycles for robotics and autonomous systems from months to days.

Mixture-of-Transformers Architecture

Cosmos 3 abandons standard single-model designs in favor of a Mixture-of-Transformers (MoT) architecture. The system uses a dual-tower approach to process multimodal inputs simultaneously.

The Reasoner Tower acts as an autoregressive vision-language model. It interprets text, images, and video to extract motion, spatio-temporal relationships, and object interactions. The Generation Tower is a diffusion-based block that receives context from the reasoner to output physically grounded predictions. These outputs include predictive video sequences and robot-task trajectories.

This split architecture allows the model to generate up to 30 seconds of predictive video based on text or visual inputs. Autonomous systems can evaluate the simulated physical consequences of an action before executing it in the real world.

Model Variants and Hardware Targets

NVIDIA released three distinct versions of the model targeting different stages of the robotics development lifecycle.

ModelParametersTarget HardwarePrimary Use Case
Cosmos 3 Nano8B (8B Reasoner + 8B Generator)NVIDIA RTX PRO 6000Efficient workstation inference
Cosmos 3 Super32B (32B Reasoner + 32B Generator)NVIDIA Hopper and BlackwellHigh-fidelity synthetic data generation
Cosmos 3 EdgeNot specifiedEdge deployment hardwareReal-time on-device inference

The heavy Super variant is built for research and synthetic data generation, allowing developers to create training material for smaller models. The upcoming Edge variant will target local robotics deployment environments, similar to the hardware targets for Nemotron 3 Nano 4B.

Benchmark Performance

At launch, Cosmos 3 established new baselines across physical AI and multimodal leaderboards.

The model ranked first among open models on VANTAGE-Bench for vision-language reasoning on real-world fixed-camera footage. It also secured the top position on Artificial Analysis leaderboards for both Text-to-Image and Image-to-Video generation without audio. NVIDIA reports additional category leads on PAI-Bench, R-Bench, Physics-IQ, and RoboLab.

Ecosystem Support and Datasets

Alongside the model checkpoints on Hugging Face, NVIDIA released open code, post-training recipes, and six synthetic datasets. These datasets provide immediate training foundations for embodied robot scenes, autonomous driving, warehouse operations, and human motion simulations.

If you previously built workflows around Cosmos Predict 2.5, the new architecture requires updating your inference pipelines to handle the dual-tower MoT outputs.

NVIDIA also formed the NVIDIA Cosmos Coalition to establish deployment standards for physical AI. Launch partners include Agile Robots, Black Forest Labs, Runway, Skild AI, LTX, and Generalist. Early industry adopters actively deploying the model include Samsung, LG Electronics, Li Auto, and Doosan Robotics.

For production environments, Cosmos 3 is packaged as NVIDIA NIM microservices. Development teams can pull the optimized containers to deploy the model immediately across local GPU clusters or cloud infrastructure.

Get Insanely Good at AI

Get Insanely Good at AI

The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.

Keep Reading