NVIDIA Demos Gemma 4 VLA on $249 Jetson Orin Nano Super

NVIDIA researcher Asier Arranz published a demo running Google’s Gemma 4 VLA model on the new NVIDIA Jetson Orin Nano Super. This release integrates the 2.2-billion-parameter Gemma-4-E2B vision-language-action model into a real-time robotic control workflow. By leveraging new hardware power limits and 4-bit floating-point quantization, the demo executes complex multimodal reasoning and robotic control directly on a $249 edge device.

Hardware Specifications and Super Mode

The demonstration relies on the Jetson Orin Nano Super, an updated configuration of the standard 8GB board. A new Super Mode, enabled via JetPack 6.2, increases the board’s power target from 15W to 25W. This thermal and power adjustment drives significant clock speed increases across the system-on-chip.

Component	Base Clock	Super Mode Clock
GPU	625MHz	1,020MHz
CPU	1.5GHz	1.7GHz

The hardware delivers 67 Sparse INT8 TOPS (33 Dense TOPS) and features 8GB of LPDDR5 memory. Memory bandwidth hits 102 GB/s, representing a 1.7x increase over the previous generation. This architecture provides the necessary throughput for high-frequency control loops in physical environments.

Gemma 4 Architecture for Edge Robotics

The demo utilizes a specialized VLA adaptation of the Gemma-4-E2B model. This variant processes native image, video, and audio inputs while outputting specific control tokens for physical manipulation. The model retains the ability to generate structured JSON output, making it highly reliable for downstream robotic parsing.

Gemma 4 employs a hybrid attention architecture to manage context efficiently. The model alternates between local sliding-window attention, set to 512 tokens, and global full-context attention layers. Combined with Dual RoPE and a Shared KV Cache, this design minimizes memory overhead during inference. Distilling the multi-step reasoning capabilities from the larger 31B and 26B Mixture-of-Experts versions into the 2.2B footprint allows the model to maintain strong planning capabilities without exceeding the 8GB memory limit.

Inference Optimization and Implementation

Running a 2.2B parameter multimodal model on constrained edge hardware requires aggressive optimization. The provided Gemma4_vla.py script executes the model using NVFP4 quantization. Understanding how quantization impacts precision is crucial for edge robotics. This 4-bit floating-point format, applied via the NVIDIA Model Optimizer, maintains near 8-bit accuracy while reducing the memory footprint and lowering latency.

The software stack uses NVIDIA TensorRT-LLM for accelerated local execution. The script pulls in dedicated Speech-to-Text and Text-to-Speech models to complete the loop, enabling full voice-command capabilities alongside visual processing. For teams looking to adapt this logic for specific hardware platforms, the entire stack remains compatible with NVIDIA NeMo for local fine-tuning. If you plan to run LLMs locally for autonomous physical agents, configuring this specific quantization pipeline is a critical dependency.

Edge AI deployments require tight alignment between model size, memory bandwidth, and quantization strategy. Review the specific NVFP4 quantization parameters used in the TensorRT-LLM build scripts to understand how inference accuracy is preserved for your own VLA control schemas.

NVIDIA Demos Gemma 4 VLA on $249 Jetson Orin Nano Super

Hardware Specifications and Super Mode

Gemma 4 Architecture for Edge Robotics

Inference Optimization and Implementation

Keep Reading

How to Find GPU Gaps in PyTorch 2.12 With torch.profiler

Cloudflare Rebuilds CLI on Vite Following VoidZero Acquisition

Google Drops Vision Encoders in Gemma 4 12B Multimodal Release

Surface RTX Spark Dev Box Targets Local 120B AI Models

XCENA's $135M Series B Targets AI Memory Wall via CXL 3.x