AWS SageMaker adds NVIDIA Blackwell G7e inference instances
Amazon SageMaker AI now offers G7e instances on NVIDIA RTX PRO 6000 Blackwell GPUs, with 96GB memory and 2.3x faster inference over G6e.
Amazon Web Services launched G7e instances on Amazon SageMaker AI, upgrading their inference infrastructure to NVIDIA RTX PRO 6000 Blackwell Server Edition GPUs. The transition from the previous G6e family doubles the available GPU memory and natively supports FP4 precision. For teams operating high-traffic applications, the architectural changes lower the baseline cost per million tokens while reducing the need for complex multi-GPU setups.
Hardware Specifications
The G7e family shifts the underlying architecture from NVIDIA L40S to the Blackwell generation. This introduces fifth-generation Tensor Cores and fourth-generation Ray Tracing cores. Each GPU features 96 GB of GDDR7 memory.
Memory bandwidth scales to 1,597 GB/s, representing a 1.85x increase over the G6e line. Networking throughput also increases significantly. Instances now support up to 1,600 Gbps via Elastic Fabric Adapter (EFA), a four-fold increase over the previous generation.
The hardware natively supports FP4 precision. Applying aggressive quantization directly at the hardware layer enables higher throughput and smaller memory footprints for massive foundation models.
Deployment Capabilities
The expanded memory capacity alters the deployment math for medium to large parameter models. You can now fit models on a single node that previously required tensor parallelism across multiple GPUs.
The base G7e.2xlarge provides a single GPU capable of running a 35B parameter model like Qwen3.5-35B-A3B. Utilizing FP8 precision allows the same single-GPU instance to host 70B parameter models. If you deploy multi-agent systems relying on numerous smaller specialized models, the single-node footprint simplifies orchestration and cuts cross-node communication delays.
Scaling up, the G7e.24xlarge offers four GPUs to support models up to 150B parameters. The top-tier G7e.48xlarge configuration provides eight GPUs with 768 GB of aggregate memory. This accommodates 300B parameter foundation models. AWS specifically optimized SageMaker AI support for architectures like GPT-OSS-120B and the NVFP4 variant of Nemotron-3-Super-120B-A12B.
Cost and Throughput Benchmarks
Hardware upgrades translate to measurable cost reductions at production concurrency levels. AWS benchmarks indicate a maximum 2.3x increase in total inference performance.
At a concurrency of 32, the cost per million output tokens drops from $2.06 on G6e to $0.79 on G7e. The single-GPU instance operates at an hourly rate of $4.20.
| Metric | G6e (L40S) | G7e (Blackwell) |
|---|---|---|
| GPU Memory | 48 GB | 96 GB GDDR7 |
| Memory Bandwidth | ~864 GB/s | 1,597 GB/s |
| Output Tokens (C=32) | $2.06 / million | $0.79 / million |
| Max Networking | 400 Gbps | 1,600 Gbps |
You can compound these hardware efficiencies with software optimizations. Running EAGLE speculative decoding on SageMaker AI with G7e instances yields an additional 2.4x throughput increase. This combination pushes total inference cost reductions up to 75% for specific workloads. If you need to reduce LLM API costs in production, speculative decoding on Blackwell hardware provides a direct optimization path.
Regional Availability
The G7e instances are currently available in the US East (N. Virginia) and US East (Ohio) regions. They are accessible through SageMaker AI for dedicated inference hosting. AWS also offers them as standard EC2 instances under on-demand, spot, or savings plan pricing models.
Evaluate your current tensor parallelism configurations. If you distribute a 70B model across two or four older GPUs, migrating to a single G7e.2xlarge with FP8 precision will reduce network latency and lower your hosting bill. Test your specific model weights with native FP4 compilation to determine if the Blackwell architecture allows you to consolidate your inference nodes entirely.
Get Insanely Good at AI
The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.
Keep Reading
How to Use the New Unified Cloudflare CLI and Local Explorer
Learn how to use Cloudflare's new cf CLI and Local Explorer to streamline cross-product development and debug local data for AI agents and human developers.
Developer Claims to Crack Google SynthID AI Watermarking
A new open-source tool dubbed 'reverse-SynthID' claims to bypass Google DeepMind’s invisible watermarks using signal processing and spectral analysis.
Runway Announces $10M Fund for Early-Stage AI Startups
Runway formalizes its venture arm with a $10 million fund and Builders program to support early-stage startups using its video intelligence infrastructure.
Mistral AI Raises $830M for New Data Center Near Paris
Mistral AI has secured $830 million in debt financing to build a sovereign data center in France featuring 13,800 NVIDIA Blackwell GPUs.
Cursor Agents Boost CUDA Kernel Speed by 38% on NVIDIA Blackwell
A new multi-agent system from Cursor achieves massive performance gains on NVIDIA Blackwell GPUs by autonomously optimizing complex CUDA kernels.