Ai Engineering 2 min read

SageMaker Endpoints Now Expose Native OpenAI Completions API

AWS updated Amazon SageMaker AI to expose standard OpenAI API paths, allowing seamless migration of agent workloads without custom SigV4 wrappers.

On May 20, 2026, AWS launched OpenAI-compatible API support for Amazon SageMaker AI real-time inference endpoints. Developers can now point standard OpenAI ecosystem tools directly at AWS infrastructure by simply changing the base URL. This eliminates the need for custom client-side logic, code rewrites, or complex SigV4 signing wrappers previously required to route traffic to AWS hosted models.

Authentication and API Design

SageMaker AI endpoints now natively expose the /openai/v1/chat/completions path. The endpoints support standard Chat Completions requests and handle streaming responses natively using Server-Sent Events (SSE).

To integrate smoothly with the standard OpenAI SDKs in Python and JavaScript, AWS introduced time-limited bearer tokens. You generate these using the sagemaker.core.token_generator.generate_token function in the SageMaker Python SDK. Tokens remain valid for up to 12 hours, with the exact duration configurable down to a single second.

Security remains tied to your existing AWS IAM credentials. The underlying role executing the requests requires the sagemaker:CallWithBearerToken and sagemaker:InvokeEndpoint permissions to authenticate the token generation and inference invocation.

Multi-Model Routing Capabilities

The update extends beyond single-model deployments. The API supports multi-model hosting through SageMaker AI inference components. You can deploy multiple specialized models to a single endpoint and route requests dynamically.

A single OpenAI-compatible base URL can direct general queries to a Llama instance and domain-specific tasks to a fine-tuned Mistral model. The routing depends entirely on the model name specified in the client request payload. This is highly useful when building multi-step AI agents that require different models for reasoning, data extraction, and synthesis.

Supported Frameworks and Containers

The compatibility layer operates out-of-the-box with popular AI agent frameworks like LangChain and Strands Agents. It removes the friction of deploying enterprise-grade AI inference within a secure Virtual Private Cloud (VPC) while keeping standard open-source tooling intact.

AWS officially supports the SageMaker AI vLLM Deep Learning Container and the SGLang Deep Learning Container. You can also use custom containers, provided they implement the /v1/chat/completions and /ping network paths. The feature is available in 14 AWS regions at launch.

If your application relies on standard OpenAI SDKs or gateways like the Vercel AI SDK, you can now migrate those workloads to dedicated AWS GPU instances. Replace your existing API key with a SageMaker generated bearer token and update the base URL to your endpoint to route traffic through AWS.

Get Insanely Good at AI

Get Insanely Good at AI

The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.

Keep Reading