Google Is Solving the LLM Memory Bottleneck with TurboQuant
Google Research published TurboQuant, a data-oblivious quantization algorithm that compresses LLM key-value caches to 3.5 bits per channel with zero accuracy loss and up to 8x speedup on H100 GPUs.
Google Research published TurboQuant, a vector quantization algorithm that compresses LLM key-value caches to 3.5 bits per channel with zero accuracy loss. At 4 bits, it delivers up to 8x speedup in attention scoring on H100 GPUs compared to unquantized 32-bit keys. The paper was presented at ICLR 2026 and is available on arXiv.
The KV cache is one of the main memory bottlenecks in serving large language models. Every generated token requires storing key and value embeddings from all previous tokens across all attention layers. For long-context models, this scales linearly with both model size and context length, consuming significant GPU memory and creating communication bottlenecks between HBM and SRAM on accelerators.
How TurboQuant Works
TurboQuant compresses vectors through two stages:
Stage 1 (PolarQuant): The algorithm applies a random rotation to input vectors, which induces a concentrated Beta distribution on each coordinate. In high dimensions, this distribution converges to a Gaussian, and distinct coordinates become nearly independent. That independence is what makes the approach work: it allows TurboQuant to apply optimal scalar quantizers per coordinate individually, without needing to account for correlations between coordinates, while still achieving near-optimal distortion.
The rotation also converts coordinate pairs into polar form (radius and angle), which eliminates the need for per-block normalization constants that other quantization methods carry as metadata overhead. Traditional methods typically add 1-2 extra bits per number just for this bookkeeping.
Stage 2 (Quantized Johnson-Lindenstrauss): MSE-optimal quantizers introduce bias when estimating inner products. TurboQuant addresses this by applying a 1-bit QJL transform to the residual error from Stage 1. This produces an unbiased inner product estimator with low distortion, which is critical because attention scoring in transformers relies on inner products between query and key vectors.
The two-stage design is grounded in Shannon’s source coding theory. TurboQuant achieves distortion rates within a factor of approximately 2.7x of the information-theoretic lower bound.
Benchmark Results
Google evaluated TurboQuant across five long-context benchmarks (LongBench, Needle In A Haystack, ZeroSCROLLS, RULER, and L-Eval) using Gemma and Mistral models.
| Bit-width | Memory reduction | Accuracy | Speedup (H100) |
|---|---|---|---|
| 3.5 bits/channel | ~4.5x vs FP16 | Zero loss | Significant |
| 4 bits/channel | ~4x vs FP16 | Zero loss | Up to 8x vs FP32 |
| 2.5 bits/channel | ~6x vs FP16 | Marginal degradation | Significant |
At 3.5 bits per channel, TurboQuant achieved perfect scores on needle-in-haystack tasks across all context lengths. At 2.5 bits, there was marginal quality degradation, but memory usage dropped to roughly one-sixth of FP16 storage.
Why It Matters for Inference
The algorithm is data-oblivious, meaning it requires no dataset-specific tuning, calibration, or preprocessing. You do not need to run calibration datasets through the model to configure the quantizer. This makes it suitable for online, real-time KV cache compression during inference, where new key-value pairs are generated continuously and need to be quantized on the fly.
It is also designed to be GPU-friendly. The operations (random rotations, per-coordinate scalar quantization, 1-bit transforms) map naturally to vectorized accelerator instructions. This contrasts with some alternative quantization methods that rely on binary search or codebook lookups that are hard to parallelize on GPUs.
Beyond KV Caches
TurboQuant also applies to vector search, where it compresses database vectors for nearest-neighbor lookups. Google reports that TurboQuant outperforms existing product quantization methods in recall while reducing indexing time to near zero, since there is no codebook training step. For RAG systems and vector databases that rely on high-dimensional similarity search, this could reduce both index size and query latency.
Current Availability
TurboQuant is a research publication. Google has not released an open-source implementation. The algorithm is described in full in the ICLR 2026 paper, with precomputed optimal codebooks for practical bit-widths. The technique is general enough to apply to any transformer-based LLM that uses standard multi-head or grouped-query attention with a KV cache.
Get Insanely Good at AI
The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.
Keep Reading
What Is Quantization in AI?
Quantization shrinks AI models by reducing numerical precision. Here's how it works, what formats exist, and how to choose the right tradeoff between size, speed, and quality.
What Is AI Inference and How Does It Work?
Inference is where AI models do their actual work. Here's what happens during inference, why it's the bottleneck, and what determines speed and cost.
OpenAI has Shut Down Sora and a Billion-Dollar Disney Deal
OpenAI is shutting down Sora, calling it a 'side quest.' The framing tells you where AI companies think the real value is.
LiteLLM PyPI Attack Risks Credential Theft on Install
Compromised LiteLLM PyPI versions 1.82.7 and 1.82.8 could auto-run malware and steal credentials from Python environments.
Lyria 3 Now Lets Developers Generate Full Songs
Google added Lyria 3 to the Gemini API and AI Studio, letting developers generate songs with lyrics, structure controls, and image input.