Cosmos 3 Open Omnimodel Merges World Simulation and Action

NVIDIA released Cosmos 3 at GTC Taipei, an open omnimodel built specifically for physical AI workloads. The system unifies vision reasoning, world simulation, and action prediction into a single architecture. This combined pipeline compresses the training and evaluation cycles for robotics and autonomous systems from months to days.

Mixture-of-Transformers Architecture

Cosmos 3 abandons standard single-model designs in favor of a Mixture-of-Transformers (MoT) architecture. The system uses a dual-tower approach to process multimodal inputs simultaneously.

The Reasoner Tower acts as an autoregressive vision-language model. It interprets text, images, and video to extract motion, spatio-temporal relationships, and object interactions. The Generation Tower is a diffusion-based block that receives context from the reasoner to output physically grounded predictions. These outputs include predictive video sequences and robot-task trajectories.

This split architecture allows the model to generate up to 30 seconds of predictive video based on text or visual inputs. Autonomous systems can evaluate the simulated physical consequences of an action before executing it in the real world.

Model Variants and Hardware Targets

NVIDIA released three distinct versions of the model targeting different stages of the robotics development lifecycle.

Model	Parameters	Target Hardware	Primary Use Case
Cosmos 3 Nano	8B (8B Reasoner + 8B Generator)	NVIDIA RTX PRO 6000	Efficient workstation inference
Cosmos 3 Super	32B (32B Reasoner + 32B Generator)	NVIDIA Hopper and Blackwell	High-fidelity synthetic data generation
Cosmos 3 Edge	Not specified	Edge deployment hardware	Real-time on-device inference

The heavy Super variant is built for research and synthetic data generation, allowing developers to create training material for smaller models. The upcoming Edge variant will target local robotics deployment environments, similar to the hardware targets for Nemotron 3 Nano 4B.

Benchmark Performance

At launch, Cosmos 3 established new baselines across physical AI and multimodal leaderboards.

The model ranked first among open models on VANTAGE-Bench for vision-language reasoning on real-world fixed-camera footage. It also secured the top position on Artificial Analysis leaderboards for both Text-to-Image and Image-to-Video generation without audio. NVIDIA reports additional category leads on PAI-Bench, R-Bench, Physics-IQ, and RoboLab.

Ecosystem Support and Datasets

Alongside the model checkpoints on Hugging Face, NVIDIA released open code, post-training recipes, and six synthetic datasets. These datasets provide immediate training foundations for embodied robot scenes, autonomous driving, warehouse operations, and human motion simulations.

If you previously built workflows around Cosmos Predict 2.5, the new architecture requires updating your inference pipelines to handle the dual-tower MoT outputs.

NVIDIA also formed the NVIDIA Cosmos Coalition to establish deployment standards for physical AI. Launch partners include Agile Robots, Black Forest Labs, Runway, Skild AI, LTX, and Generalist. Early industry adopters actively deploying the model include Samsung, LG Electronics, Li Auto, and Doosan Robotics.

For production environments, Cosmos 3 is packaged as NVIDIA NIM microservices. Development teams can pull the optimized containers to deploy the model immediately across local GPU clusters or cloud infrastructure.

Cosmos 3 Open Omnimodel Merges World Simulation and Action

Mixture-of-Transformers Architecture

Model Variants and Hardware Targets

Benchmark Performance

Ecosystem Support and Datasets

Keep Reading

How to Fine-Tune Cosmos Predict 2.5 for Robotics With LoRA

Cascaded Speech Pipeline Brings Reachy Mini Inference Local

GENE-26.5 Gives Hardware-Agnostic Robots Human-Scale Dexterity

How to Get Started with Open-H, GR00T-H, and Cosmos-H for Healthcare Robotics Research

Untrained Tasks Now Possible via π0.7 Robotic Brain