Ai Engineering 3 min read

AWS SageMaker adds NVIDIA Blackwell G7e inference instances

Amazon SageMaker AI now offers G7e instances on NVIDIA RTX PRO 6000 Blackwell GPUs, with 96GB memory and 2.3x faster inference over G6e.

Amazon Web Services launched G7e instances on Amazon SageMaker AI, upgrading their inference infrastructure to NVIDIA RTX PRO 6000 Blackwell Server Edition GPUs. The transition from the previous G6e family doubles the available GPU memory and natively supports FP4 precision. For teams operating high-traffic applications, the architectural changes lower the baseline cost per million tokens while reducing the need for complex multi-GPU setups.

Hardware Specifications

The G7e family shifts the underlying architecture from NVIDIA L40S to the Blackwell generation. This introduces fifth-generation Tensor Cores and fourth-generation Ray Tracing cores. Each GPU features 96 GB of GDDR7 memory.

Memory bandwidth scales to 1,597 GB/s, representing a 1.85x increase over the G6e line. Networking throughput also increases significantly. Instances now support up to 1,600 Gbps via Elastic Fabric Adapter (EFA), a four-fold increase over the previous generation.

The hardware natively supports FP4 precision. Applying aggressive quantization directly at the hardware layer enables higher throughput and smaller memory footprints for massive foundation models.

Deployment Capabilities

The expanded memory capacity alters the deployment math for medium to large parameter models. You can now fit models on a single node that previously required tensor parallelism across multiple GPUs.

The base G7e.2xlarge provides a single GPU capable of running a 35B parameter model like Qwen3.5-35B-A3B. Utilizing FP8 precision allows the same single-GPU instance to host 70B parameter models. If you deploy multi-agent systems relying on numerous smaller specialized models, the single-node footprint simplifies orchestration and cuts cross-node communication delays.

Scaling up, the G7e.24xlarge offers four GPUs to support models up to 150B parameters. The top-tier G7e.48xlarge configuration provides eight GPUs with 768 GB of aggregate memory. This accommodates 300B parameter foundation models. AWS specifically optimized SageMaker AI support for architectures like GPT-OSS-120B and the NVFP4 variant of Nemotron-3-Super-120B-A12B.

Cost and Throughput Benchmarks

Hardware upgrades translate to measurable cost reductions at production concurrency levels. AWS benchmarks indicate a maximum 2.3x increase in total inference performance.

At a concurrency of 32, the cost per million output tokens drops from $2.06 on G6e to $0.79 on G7e. The single-GPU instance operates at an hourly rate of $4.20.

MetricG6e (L40S)G7e (Blackwell)
GPU Memory48 GB96 GB GDDR7
Memory Bandwidth~864 GB/s1,597 GB/s
Output Tokens (C=32)$2.06 / million$0.79 / million
Max Networking400 Gbps1,600 Gbps

You can compound these hardware efficiencies with software optimizations. Running EAGLE speculative decoding on SageMaker AI with G7e instances yields an additional 2.4x throughput increase. This combination pushes total inference cost reductions up to 75% for specific workloads. If you need to reduce LLM API costs in production, speculative decoding on Blackwell hardware provides a direct optimization path.

Regional Availability

The G7e instances are currently available in the US East (N. Virginia) and US East (Ohio) regions. They are accessible through SageMaker AI for dedicated inference hosting. AWS also offers them as standard EC2 instances under on-demand, spot, or savings plan pricing models.

Evaluate your current tensor parallelism configurations. If you distribute a 70B model across two or four older GPUs, migrating to a single G7e.2xlarge with FP8 precision will reduce network latency and lower your hosting bill. Test your specific model weights with native FP4 compilation to determine if the Blackwell architecture allows you to consolidate your inference nodes entirely.

Get Insanely Good at AI

Get Insanely Good at AI

The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.

Keep Reading