SageMaker Endpoints Now Expose Native OpenAI Completions API

On May 20, 2026, AWS launched OpenAI-compatible API support for Amazon SageMaker AI real-time inference endpoints. Developers can now point standard OpenAI ecosystem tools directly at AWS infrastructure by simply changing the base URL. This eliminates the need for custom client-side logic, code rewrites, or complex SigV4 signing wrappers previously required to route traffic to AWS hosted models.

Authentication and API Design

SageMaker AI endpoints now natively expose the /openai/v1/chat/completions path. The endpoints support standard Chat Completions requests and handle streaming responses natively using Server-Sent Events (SSE).

To integrate smoothly with the standard OpenAI SDKs in Python and JavaScript, AWS introduced time-limited bearer tokens. You generate these using the sagemaker.core.token_generator.generate_token function in the SageMaker Python SDK. Tokens remain valid for up to 12 hours, with the exact duration configurable down to a single second.

Security remains tied to your existing AWS IAM credentials. The underlying role executing the requests requires the sagemaker:CallWithBearerToken and sagemaker:InvokeEndpoint permissions to authenticate the token generation and inference invocation.

Multi-Model Routing Capabilities

The update extends beyond single-model deployments. The API supports multi-model hosting through SageMaker AI inference components. You can deploy multiple specialized models to a single endpoint and route requests dynamically.

A single OpenAI-compatible base URL can direct general queries to a Llama instance and domain-specific tasks to a fine-tuned Mistral model. The routing depends entirely on the model name specified in the client request payload. This is highly useful when building multi-step AI agents that require different models for reasoning, data extraction, and synthesis.

Supported Frameworks and Containers

The compatibility layer operates out-of-the-box with popular AI agent frameworks like LangChain and Strands Agents. It removes the friction of deploying enterprise-grade AI inference within a secure Virtual Private Cloud (VPC) while keeping standard open-source tooling intact.

AWS officially supports the SageMaker AI vLLM Deep Learning Container and the SGLang Deep Learning Container. You can also use custom containers, provided they implement the /v1/chat/completions and /ping network paths. The feature is available in 14 AWS regions at launch.

If your application relies on standard OpenAI SDKs or gateways like the Vercel AI SDK, you can now migrate those workloads to dedicated AWS GPU instances. Replace your existing API key with a SageMaker generated bearer token and update the base URL to your endpoint to route traffic through AWS.

SageMaker Endpoints Now Expose Native OpenAI Completions API

Authentication and API Design

Multi-Model Routing Capabilities

Supported Frameworks and Containers

Keep Reading

How to Use the New Unified Cloudflare CLI and Local Explorer

Runpod Flash Removes Container Overhead for AI Inference

Agents Can Provision Cloudflare Accounts via Stripe Projects

Claude Platform Goes GA on AWS With Native API Parity

Pentagon Approves Eight AI Vendors For IL7 Classified Networks