Google Is Solving the LLM Memory Bottleneck with TurboQuant
Google Research published TurboQuant, a data-oblivious quantization algorithm that compresses LLM key-value caches to 3.5 bits per channel with zero accuracy loss and up to 8x speedup on H100 GPUs.
Google Research published TurboQuant, a vector quantization algorithm that compresses LLM key-value caches to 3.5 bits per channel with zero accuracy loss. At 4 bits, it delivers up to 8x speedup in attention scoring on H100 GPUs compared to unquantized 32-bit keys. The paper was presented at ICLR 2026 and is available on arXiv.
The KV cache is one of the main memory bottlenecks in serving large language models. Every generated token requires storing key and value embeddings from all previous tokens across all attention layers. For long-context models, this scales linearly with both model size and context length, consuming significant GPU memory and creating communication bottlenecks between HBM and SRAM on accelerators.
How TurboQuant Works
TurboQuant compresses vectors through two stages:
Stage 1 (PolarQuant): The algorithm applies a random rotation to input vectors, which induces a concentrated Beta distribution on each coordinate. In high dimensions, this distribution converges to a Gaussian, and distinct coordinates become nearly independent. That independence is what makes the approach work: it allows TurboQuant to apply optimal scalar quantizers per coordinate individually, without needing to account for correlations between coordinates, while still achieving near-optimal distortion.
The rotation also converts coordinate pairs into polar form (radius and angle), which eliminates the need for per-block normalization constants that other quantization methods carry as metadata overhead. Traditional methods typically add 1-2 extra bits per number just for this bookkeeping.
Stage 2 (Quantized Johnson-Lindenstrauss): MSE-optimal quantizers introduce bias when estimating inner products. TurboQuant addresses this by applying a 1-bit QJL transform to the residual error from Stage 1. This produces an unbiased inner product estimator with low distortion, which is critical because attention scoring in transformers relies on inner products between query and key vectors.
The two-stage design is grounded in Shannon’s source coding theory. TurboQuant achieves distortion rates within a factor of approximately 2.7x of the information-theoretic lower bound.
Benchmark Results
Google evaluated TurboQuant across five long-context benchmarks (LongBench, Needle In A Haystack, ZeroSCROLLS, RULER, and L-Eval) using Gemma and Mistral models.
| Bit-width | Memory reduction | Accuracy | Speedup (H100) |
|---|---|---|---|
| 3.5 bits/channel | ~4.5x vs FP16 | Zero loss | Significant |
| 4 bits/channel | ~4x vs FP16 | Zero loss | Up to 8x vs FP32 |
| 2.5 bits/channel | ~6x vs FP16 | Marginal degradation | Significant |
At 3.5 bits per channel, TurboQuant achieved perfect scores on needle-in-haystack tasks across all context lengths. At 2.5 bits, there was marginal quality degradation, but memory usage dropped to roughly one-sixth of FP16 storage.
Why It Matters for Inference
The algorithm is data-oblivious, meaning it requires no dataset-specific tuning, calibration, or preprocessing. You do not need to run calibration datasets through the model to configure the quantizer. This makes it suitable for online, real-time KV cache compression during inference, where new key-value pairs are generated continuously and need to be quantized on the fly.
It is also designed to be GPU-friendly. The operations (random rotations, per-coordinate scalar quantization, 1-bit transforms) map naturally to vectorized accelerator instructions. This contrasts with some alternative quantization methods that rely on binary search or codebook lookups that are hard to parallelize on GPUs.
Beyond KV Caches
TurboQuant also applies to vector search, where it compresses database vectors for nearest-neighbor lookups. Google reports that TurboQuant outperforms existing product quantization methods in recall while reducing indexing time to near zero, since there is no codebook training step. For RAG systems and vector databases that rely on high-dimensional similarity search, this could reduce both index size and query latency.
Current Availability
TurboQuant is a research publication. Google has not released an open-source implementation. The algorithm is described in full in the ICLR 2026 paper, with precomputed optimal codebooks for practical bit-widths. The technique is general enough to apply to any transformer-based LLM that uses standard multi-head or grouped-query attention with a KV cache.
Get Insanely Good at AI
The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.
Keep Reading
What Is Quantization in AI?
Quantization shrinks AI models by reducing numerical precision. Here's how it works, what formats exist, and how to choose the right tradeoff between size, speed, and quality.
TurboQuant Cuts LLM Memory Use by 6x Without Quality Loss
Google Research unveils TurboQuant, a compression suite delivering 8x faster inference and massive VRAM savings for long-context models like Llama-3.1.
NVIDIA Demos Gemma 4 VLA on $249 Jetson Orin Nano Super
NVIDIA showcased Google's Gemma 4 VLA running natively on the Jetson Orin Nano Super using NVFP4 quantization and a new 25W hardware performance mode.
MoGen Synthetic Data Slashes Brain Mapping Error Rates
Google Research debuts MoGen, a generative model creating synthetic neurons to save 157 person-years of manual proofreading in mouse brain reconstruction.
Google’s Simula: Architecting Datasets via Mechanism Design
Google Research introduces Simula, a reasoning-first framework that treats synthetic data generation as programmable mechanism design for better model training.