AWS SageMaker adds NVIDIA Blackwell G7e inference instances
Amazon SageMaker AI now offers G7e instances on NVIDIA RTX PRO 6000 Blackwell GPUs, with 96GB memory and 2.3x faster inference over G6e.
Amazon Web Services launched G7e instances on Amazon SageMaker AI, upgrading their inference infrastructure to NVIDIA RTX PRO 6000 Blackwell Server Edition GPUs. The transition from the previous G6e family doubles the available GPU memory and natively supports FP4 precision. For teams operating high-traffic applications, the architectural changes lower the baseline cost per million tokens while reducing the need for complex multi-GPU setups.
Hardware Specifications
The G7e family shifts the underlying architecture from NVIDIA L40S to the Blackwell generation. This introduces fifth-generation Tensor Cores and fourth-generation Ray Tracing cores. Each GPU features 96 GB of GDDR7 memory.
Memory bandwidth scales to 1,597 GB/s, representing a 1.85x increase over the G6e line. Networking throughput also increases significantly. Instances now support up to 1,600 Gbps via Elastic Fabric Adapter (EFA), a four-fold increase over the previous generation.
The hardware natively supports FP4 precision. Applying aggressive quantization directly at the hardware layer enables higher throughput and smaller memory footprints for massive foundation models.
Deployment Capabilities
The expanded memory capacity alters the deployment math for medium to large parameter models. You can now fit models on a single node that previously required tensor parallelism across multiple GPUs.
The base G7e.2xlarge provides a single GPU capable of running a 35B parameter model like Qwen3.5-35B-A3B. Utilizing FP8 precision allows the same single-GPU instance to host 70B parameter models. If you deploy multi-agent systems relying on numerous smaller specialized models, the single-node footprint simplifies orchestration and cuts cross-node communication delays.
Scaling up, the G7e.24xlarge offers four GPUs to support models up to 150B parameters. The top-tier G7e.48xlarge configuration provides eight GPUs with 768 GB of aggregate memory. This accommodates 300B parameter foundation models. AWS specifically optimized SageMaker AI support for architectures like GPT-OSS-120B and the NVFP4 variant of Nemotron-3-Super-120B-A12B.
Cost and Throughput Benchmarks
Hardware upgrades translate to measurable cost reductions at production concurrency levels. AWS benchmarks indicate a maximum 2.3x increase in total inference performance.
At a concurrency of 32, the cost per million output tokens drops from $2.06 on G6e to $0.79 on G7e. The single-GPU instance operates at an hourly rate of $4.20.
| Metric | G6e (L40S) | G7e (Blackwell) |
|---|---|---|
| GPU Memory | 48 GB | 96 GB GDDR7 |
| Memory Bandwidth | ~864 GB/s | 1,597 GB/s |
| Output Tokens (C=32) | $2.06 / million | $0.79 / million |
| Max Networking | 400 Gbps | 1,600 Gbps |
You can compound these hardware efficiencies with software optimizations. Running EAGLE speculative decoding on SageMaker AI with G7e instances yields an additional 2.4x throughput increase. This combination pushes total inference cost reductions up to 75% for specific workloads. If you need to reduce LLM API costs in production, speculative decoding on Blackwell hardware provides a direct optimization path.
Regional Availability
The G7e instances are currently available in the US East (N. Virginia) and US East (Ohio) regions. They are accessible through SageMaker AI for dedicated inference hosting. AWS also offers them as standard EC2 instances under on-demand, spot, or savings plan pricing models.
Evaluate your current tensor parallelism configurations. If you distribute a 70B model across two or four older GPUs, migrating to a single G7e.2xlarge with FP8 precision will reduce network latency and lower your hosting bill. Test your specific model weights with native FP4 compilation to determine if the Blackwell architecture allows you to consolidate your inference nodes entirely.
Get Insanely Good at AI
The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.
Keep Reading
How to Route GPU GitHub Actions to Hugging Face Jobs
Offload your training and GPU-heavy CI workloads to Hugging Face Jobs using their new ephemeral GitHub runners and action integrations.
Gemini Enterprise Demand Drives $30B SpaceX GPU Contract
Google has signed a $30 billion agreement to rent 110,000 NVIDIA GPUs from SpaceX at $920 million per month to meet demand for its Gemini Enterprise platform.
AI Exploit Chains Prompt Cloudflare's New Defense Architecture
Cloudflare detailed a four-layer security architecture designed to counter rapid exploit chain construction by frontier AI models like Claude Mythos.
Decart Oasis 3 API Renders Endless Driving Sims at 22 FPS
Decart's Oasis 3 is an interactive world model available via API that generates real-time, closed-loop driving environments for autonomous vehicle validation.
Cloudflare Rebuilds CLI on Vite Following VoidZero Acquisition
Cloudflare acquired VoidZero, bringing the Rust-based Vite build ecosystem internally to unify local development environments with global edge runtimes.