NVIDIA Demos Gemma 4 VLA on $249 Jetson Orin Nano Super
NVIDIA showcased Google's Gemma 4 VLA running natively on the Jetson Orin Nano Super using NVFP4 quantization and a new 25W hardware performance mode.
NVIDIA researcher Asier Arranz published a demo running Google’s Gemma 4 VLA model on the new NVIDIA Jetson Orin Nano Super. This release integrates the 2.2-billion-parameter Gemma-4-E2B vision-language-action model into a real-time robotic control workflow. By leveraging new hardware power limits and 4-bit floating-point quantization, the demo executes complex multimodal reasoning and robotic control directly on a $249 edge device.
Hardware Specifications and Super Mode
The demonstration relies on the Jetson Orin Nano Super, an updated configuration of the standard 8GB board. A new Super Mode, enabled via JetPack 6.2, increases the board’s power target from 15W to 25W. This thermal and power adjustment drives significant clock speed increases across the system-on-chip.
| Component | Base Clock | Super Mode Clock |
|---|---|---|
| GPU | 625MHz | 1,020MHz |
| CPU | 1.5GHz | 1.7GHz |
The hardware delivers 67 Sparse INT8 TOPS (33 Dense TOPS) and features 8GB of LPDDR5 memory. Memory bandwidth hits 102 GB/s, representing a 1.7x increase over the previous generation. This architecture provides the necessary throughput for high-frequency control loops in physical environments.
Gemma 4 Architecture for Edge Robotics
The demo utilizes a specialized VLA adaptation of the Gemma-4-E2B model. This variant processes native image, video, and audio inputs while outputting specific control tokens for physical manipulation. The model retains the ability to generate structured JSON output, making it highly reliable for downstream robotic parsing.
Gemma 4 employs a hybrid attention architecture to manage context efficiently. The model alternates between local sliding-window attention, set to 512 tokens, and global full-context attention layers. Combined with Dual RoPE and a Shared KV Cache, this design minimizes memory overhead during inference. Distilling the multi-step reasoning capabilities from the larger 31B and 26B Mixture-of-Experts versions into the 2.2B footprint allows the model to maintain strong planning capabilities without exceeding the 8GB memory limit.
Inference Optimization and Implementation
Running a 2.2B parameter multimodal model on constrained edge hardware requires aggressive optimization. The provided Gemma4_vla.py script executes the model using NVFP4 quantization. Understanding how quantization impacts precision is crucial for edge robotics. This 4-bit floating-point format, applied via the NVIDIA Model Optimizer, maintains near 8-bit accuracy while reducing the memory footprint and lowering latency.
The software stack uses NVIDIA TensorRT-LLM for accelerated local execution. The script pulls in dedicated Speech-to-Text and Text-to-Speech models to complete the loop, enabling full voice-command capabilities alongside visual processing. For teams looking to adapt this logic for specific hardware platforms, the entire stack remains compatible with NVIDIA NeMo for local fine-tuning. If you plan to run LLMs locally for autonomous physical agents, configuring this specific quantization pipeline is a critical dependency.
Edge AI deployments require tight alignment between model size, memory bandwidth, and quantization strategy. Review the specific NVFP4 quantization parameters used in the TensorRT-LLM build scripts to understand how inference accuracy is preserved for your own VLA control schemas.
Get Insanely Good at AI
The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.
Keep Reading
How to Find GPU Gaps in PyTorch 2.12 With torch.profiler
Learn how to identify performance bottlenecks and idle GPU lanes using the native torch.profiler in PyTorch 2.12 across Blackwell and AMD hardware.
Cloudflare Rebuilds CLI on Vite Following VoidZero Acquisition
Cloudflare acquired VoidZero, bringing the Rust-based Vite build ecosystem internally to unify local development environments with global edge runtimes.
Google Drops Vision Encoders in Gemma 4 12B Multimodal Release
Google DeepMind's new 12-billion parameter model uses a unified architecture to process text, image, and native audio directly on laptops with 16GB of RAM.
Surface RTX Spark Dev Box Targets Local 120B AI Models
The new Surface RTX Spark Dev Box combines 20 Arm cores, a Blackwell GPU, and 128 GB of unified memory in a 100W chassis for local AI model fine-tuning.
XCENA's $135M Series B Targets AI Memory Wall via CXL 3.x
South Korean startup XCENA raised $135 million to build computational memory chips that embed RISC-V cores alongside DDR5 DRAM to reduce AI latency.