Google Is Solving the LLM Memory Bottleneck with TurboQuant

Google Research published TurboQuant, a vector quantization algorithm that compresses LLM key-value caches to 3.5 bits per channel with zero accuracy loss. At 4 bits, it delivers up to 8x speedup in attention scoring on H100 GPUs compared to unquantized 32-bit keys. The paper was presented at ICLR 2026 and is available on arXiv.

The KV cache is one of the main memory bottlenecks in serving large language models. Every generated token requires storing key and value embeddings from all previous tokens across all attention layers. For long-context models, this scales linearly with both model size and context length, consuming significant GPU memory and creating communication bottlenecks between HBM and SRAM on accelerators.

How TurboQuant Works

TurboQuant compresses vectors through two stages:

Stage 1 (PolarQuant): The algorithm applies a random rotation to input vectors, which induces a concentrated Beta distribution on each coordinate. In high dimensions, this distribution converges to a Gaussian, and distinct coordinates become nearly independent. That independence is what makes the approach work: it allows TurboQuant to apply optimal scalar quantizers per coordinate individually, without needing to account for correlations between coordinates, while still achieving near-optimal distortion.

The rotation also converts coordinate pairs into polar form (radius and angle), which eliminates the need for per-block normalization constants that other quantization methods carry as metadata overhead. Traditional methods typically add 1-2 extra bits per number just for this bookkeeping.

Stage 2 (Quantized Johnson-Lindenstrauss): MSE-optimal quantizers introduce bias when estimating inner products. TurboQuant addresses this by applying a 1-bit QJL transform to the residual error from Stage 1. This produces an unbiased inner product estimator with low distortion, which is critical because attention scoring in transformers relies on inner products between query and key vectors.

The two-stage design is grounded in Shannon’s source coding theory. TurboQuant achieves distortion rates within a factor of approximately 2.7x of the information-theoretic lower bound.

Benchmark Results

Google evaluated TurboQuant across five long-context benchmarks (LongBench, Needle In A Haystack, ZeroSCROLLS, RULER, and L-Eval) using Gemma and Mistral models.

Bit-width	Memory reduction	Accuracy	Speedup (H100)
3.5 bits/channel	~4.5x vs FP16	Zero loss	Significant
4 bits/channel	~4x vs FP16	Zero loss	Up to 8x vs FP32
2.5 bits/channel	~6x vs FP16	Marginal degradation	Significant

At 3.5 bits per channel, TurboQuant achieved perfect scores on needle-in-haystack tasks across all context lengths. At 2.5 bits, there was marginal quality degradation, but memory usage dropped to roughly one-sixth of FP16 storage.

Why It Matters for Inference

The algorithm is data-oblivious, meaning it requires no dataset-specific tuning, calibration, or preprocessing. You do not need to run calibration datasets through the model to configure the quantizer. This makes it suitable for online, real-time KV cache compression during inference, where new key-value pairs are generated continuously and need to be quantized on the fly.

It is also designed to be GPU-friendly. The operations (random rotations, per-coordinate scalar quantization, 1-bit transforms) map naturally to vectorized accelerator instructions. This contrasts with some alternative quantization methods that rely on binary search or codebook lookups that are hard to parallelize on GPUs.

Beyond KV Caches

TurboQuant also applies to vector search, where it compresses database vectors for nearest-neighbor lookups. Google reports that TurboQuant outperforms existing product quantization methods in recall while reducing indexing time to near zero, since there is no codebook training step. For RAG systems and vector databases that rely on high-dimensional similarity search, this could reduce both index size and query latency.

Current Availability

TurboQuant is a research publication. Google has not released an open-source implementation. The algorithm is described in full in the ICLR 2026 paper, with precomputed optimal codebooks for practical bit-widths. The technique is general enough to apply to any transformer-based LLM that uses standard multi-head or grouped-query attention with a KV cache.

Google Is Solving the LLM Memory Bottleneck with TurboQuant

How TurboQuant Works

Benchmark Results

Why It Matters for Inference

Beyond KV Caches

Current Availability

Keep Reading

What Is Quantization in AI?

TurboQuant Cuts LLM Memory Use by 6x Without Quality Loss

NVIDIA Demos Gemma 4 VLA on $249 Jetson Orin Nano Super

MoGen Synthetic Data Slashes Brain Mapping Error Rates

Google’s Simula: Architecting Datasets via Mechanism Design