Google's TurboQuant reduces AI LLM cache memory capacity requirements by at least six times — up to 8x performance boost on Nvidia H100 GPUs, compresses KV caches to 3 bits with no accuracy loss
The algorithm achieves up to an eight-times performance boost over unquantized keys on Nvidia H100 GPUs.
Get 3DTested's best news and in-depth reviews, straight to your inbox.
You are now subscribed
Your newsletter sign-up was successful
Google Research published TurboQuant on Tuesday, a training-free compression algorithm that quantizes LLM KV caches down to 3 bits without any loss in model accuracy. In benchmarks on Nvidia H100 GPUs, 4-bit TurboQuant delivered up to an eight-times performance increase in computing attention logits compared to unquantized 32-bit keys, while reducing KV cache memory by at least six times.
KV caches store previously computed attention data so that LLMs don’t have to recompute it at each token generation step. These caches are becoming major memory bottlenecks as context windows grow larger, and while traditional vector quantization methods can reduce the size of these caches, they introduce a small memory overhead of a few extra bits per value from the quantization constants that must be stored alongside the compressed data. That sounds small, but they’re compounding alongside larger context windows.
TurboQuant eliminates that overhead via a two-stage process. The first uses a technique called PolarQuant, which converts data vectors from standard Cartesian coordinates into polar coordinates. This separates each vector into a radius (representing magnitude) and a set of angles (representing direction). Because the angular distributions are predictable and concentrated, PolarQuant skips the expensive per-block normalization step that conventional quantizers require. This leads to high-quality compression with zero overhead from stored quantization constants.
Article continues belowThe second stage applies a 1-bit error correction layer using an algorithm called Quantized Johnson-Lindenstrauss (QJL). QJL projects the residual quantization error into a lower-dimensional space and reduces each value to a single sign bit, eliminating systematic bias in attention score calculations at negligible additional cost.
Google tested all three algorithms across long-context benchmarks, including LongBench, Needle In A Haystack, ZeroSCROLLS, RULER, and L-Eval, using open-source models Gemma and Mistral. TurboQuant achieved perfect downstream scores on needle-in-a-haystack retrieval tasks while compressing KV memory by at least six times. On the LongBench suite, which covers question answering, code generation, and summarization, TurboQuant matched or outperformed the KIVI baseline across all tasks.
The algorithm also showed strong results in vector search. Evaluated against Product Quantization and RabbiQ on the GloVe dataset, TurboQuant achieved the highest 1@k recall ratios despite those baselines relying on larger codebooks and dataset-specific tuning. Google noted that TurboQuant requires no training or fine-tuning and incurs negligible runtime overhead, making it suitable for deployment in production inference and large-scale vector search systems.
The paper, co-authored by research scientist Amir Zandieh and VP Vahab Mirrokni, will be presented at ICLR 2026 next month.
Get 3DTested's best news and in-depth reviews, straight to your inbox.
Follow 3DTested on Google News, or add us as a preferred source, to get our latest news, analysis, & reviews in your feeds.
