NVIDIA Demos Gemma 4 VLA on $249 Jetson Orin Nano Super
NVIDIA showcased Google's Gemma 4 VLA running natively on the Jetson Orin Nano Super using NVFP4 quantization and a new 25W hardware performance mode.
NVIDIA researcher Asier Arranz published a demo running Google’s Gemma 4 VLA model on the new NVIDIA Jetson Orin Nano Super. This release integrates the 2.2-billion-parameter Gemma-4-E2B vision-language-action model into a real-time robotic control workflow. By leveraging new hardware power limits and 4-bit floating-point quantization, the demo executes complex multimodal reasoning and robotic control directly on a $249 edge device.
Hardware Specifications and Super Mode
The demonstration relies on the Jetson Orin Nano Super, an updated configuration of the standard 8GB board. A new Super Mode, enabled via JetPack 6.2, increases the board’s power target from 15W to 25W. This thermal and power adjustment drives significant clock speed increases across the system-on-chip.
| Component | Base Clock | Super Mode Clock |
|---|---|---|
| GPU | 625MHz | 1,020MHz |
| CPU | 1.5GHz | 1.7GHz |
The hardware delivers 67 Sparse INT8 TOPS (33 Dense TOPS) and features 8GB of LPDDR5 memory. Memory bandwidth hits 102 GB/s, representing a 1.7x increase over the previous generation. This architecture provides the necessary throughput for high-frequency control loops in physical environments.
Gemma 4 Architecture for Edge Robotics
The demo utilizes a specialized VLA adaptation of the Gemma-4-E2B model. This variant processes native image, video, and audio inputs while outputting specific control tokens for physical manipulation. The model retains the ability to generate structured JSON output, making it highly reliable for downstream robotic parsing.
Gemma 4 employs a hybrid attention architecture to manage context efficiently. The model alternates between local sliding-window attention, set to 512 tokens, and global full-context attention layers. Combined with Dual RoPE and a Shared KV Cache, this design minimizes memory overhead during inference. Distilling the multi-step reasoning capabilities from the larger 31B and 26B Mixture-of-Experts versions into the 2.2B footprint allows the model to maintain strong planning capabilities without exceeding the 8GB memory limit.
Inference Optimization and Implementation
Running a 2.2B parameter multimodal model on constrained edge hardware requires aggressive optimization. The provided Gemma4_vla.py script executes the model using NVFP4 quantization. Understanding how quantization impacts precision is crucial for edge robotics. This 4-bit floating-point format, applied via the NVIDIA Model Optimizer, maintains near 8-bit accuracy while reducing the memory footprint and lowering latency.
The software stack uses NVIDIA TensorRT-LLM for accelerated local execution. The script pulls in dedicated Speech-to-Text and Text-to-Speech models to complete the loop, enabling full voice-command capabilities alongside visual processing. For teams looking to adapt this logic for specific hardware platforms, the entire stack remains compatible with NVIDIA NeMo for local fine-tuning. If you plan to run LLMs locally for autonomous physical agents, configuring this specific quantization pipeline is a critical dependency.
Edge AI deployments require tight alignment between model size, memory bandwidth, and quantization strategy. Review the specific NVFP4 quantization parameters used in the TensorRT-LLM build scripts to understand how inference accuracy is preserved for your own VLA control schemas.
Get Insanely Good at AI
The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.
Keep Reading
What Is Quantization in AI?
Quantization shrinks AI models by reducing numerical precision. Here's how it works, what formats exist, and how to choose the right tradeoff between size, speed, and quality.
Google Inks Multibillion GB300 Deal With Thinking Machines Lab
Google signed a multibillion-dollar agreement to provide Thinking Machines Lab with access to Nvidia GB300 infrastructure for reinforcement learning.
Cloudflare Ships Panic and Abort Recovery for Rust Workers
Cloudflare updated Rust Workers to support WebAssembly exception handling, preventing isolated panics from crashing entire serverless instances.
Scaling AI Gateway to Power Cloudflare's New Agentic Web
Cloudflare transforms its AI Gateway into a unified inference layer, offering persistent memory and dynamic runtimes to optimize multi-model agent workflows.
Kepler Space Cloud Fires Up 40 NVIDIA GPUs in Orbit
Kepler Communications opens the first scalable orbital compute cluster, utilizing NVIDIA Jetson Orin modules to bring AI edge processing to space.