Ai Engineering 3 min read

xAI Ships 2-Minute Voice Clones and Grok 4.3 APIs

xAI has introduced a fast custom voice cloning suite and a new Voice Library alongside the launch of its 1M-context Grok 4.3 model.

On April 30, 2026, xAI released Custom Voices alongside Grok 4.3, introducing a voice cloning suite and an expanded 1,000,000 token context window model. The update enables developers to clone human voices from short audio samples in under two minutes and deploy them directly via the Grok Text-to-Speech (TTS) and Voice Agent APIs.

Voice Cloning and Library Features

The new Custom Voices feature generates a production-ready voice clone from a reference audio clip. While shorter clips work, xAI specifies that recordings between 90 and 120 seconds yield optimal quality. Cloned voices are managed in a new Voice Library within the xAI console, which also houses a catalog of over 80 built-in voices supporting 28 languages.

Once processed, the system assigns a unique voice_id. This ID drops directly into existing Grok Voice implementations, acting as a direct swap for default voices. Custom clones inherit the full Grok Voice stack, meaning developers can use Speech Tags like [laugh], [sigh], or <whisper> to manipulate the cloned audio output dynamically. If you build real-time voice agents, the WebSocket integration supports these custom voice IDs natively.

Security Verification and Geographic Limits

To mitigate unauthorized cloning, xAI uses a two-stage verification process during clone generation. The target speaker must read a system-provided verification phrase live. The xAI Speech-to-Text (STT) engine transcribes this live feed to confirm active participation. Second, the system extracts speaker embeddings from both the live phrase and the primary reference audio, comparing them to verify they belong to the same person.

Custom voices are strictly scoped to the generating team’s workspace. They are not pooled into xAI’s public training data or accessible to other organizations. Due to biometric privacy laws, the feature is geographically restricted. It is currently available only in the United States, with a hard block on usage within Illinois.

API Pricing and Grok 4.3

xAI does not charge a premium for custom voice inference. Using a cloned voice_id costs the same as the standard Grok Voice tiers. This infrastructure update ships alongside Grok 4.3, xAI’s new flagship model, which extends its capacity to a 1,000,000 token context window.

ServicePricing
Grok Text-to-Speech (TTS)$4.20 per 1 million characters
Grok Voice Agent$3.00 per hour
Grok 4.3 (Input)$1.25 per 1 million tokens
Grok 4.3 (Output)$2.50 per 1 million tokens

This positions Grok 4.3 aggressively against competitor models from OpenAI and Anthropic, particularly for high-volume agentic tasks that require extensive context retention alongside low-latency audio generation. The Custom Voices feature is initially rolling out to SuperGrok and X Premium+ subscribers.

If you integrate the Grok Voice Agent API, you can swap out standard system voices immediately by passing the new custom voice_id in your WebSocket connection payloads. The latency profile remains unchanged, allowing applications to maintain real-time conversational speeds with domain-specific or branded voices.

Get Insanely Good at AI

Get Insanely Good at AI

The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.

Keep Reading