TurboQuant: How Google Is Shrinking AI Models Without Breaking Them

What you'll learn

Why AI chatbots have a memory problem
What compression (quantization) does and why the current fix is clunky
How TurboQuant solves it with two clever tricks
What changes for you as someone who uses AI tools

Google Research recently published a paper called TurboQuant that makes AI models 6x more memory-efficient without losing any accuracy. Here’s what that actually means.

The memory problem

Every time you chat with an AI (ChatGPT, Gemini, Claude), the model keeps a running memory of your conversation so far. It needs this to give you relevant responses instead of starting from scratch with every message.

That scratch pad in the middle is called the KV cache (key-value cache). It’s what lets the AI “remember” context without re-reading your entire conversation for every word it generates.

The bottleneck

The longer your conversation (or the bigger the document), the more memory the scratch pad consumes. At some point, the AI runs out of memory space before it runs out of intelligence. That's why AI tools have "context length" limits.

The current fix: make the numbers smaller

The standard solution is called quantization - compressing the numbers so they take up less space.

Think of it like packing your clothes into a smaller suitcase, but then needing a second bag just to hold the packing instructions. The savings get eaten up by the overhead.

How TurboQuant fixes this

TurboQuant removes the need for that “cheat sheet” entirely. It does this in two steps.

Step 1

PolarQuant - describe the data differently

Normally, data is described using grid coordinates - "go 3 blocks east, then 4 blocks north." PolarQuant switches to a different system - "go 5 blocks at a 37-degree angle." Same destination, different description.

Why this helps: When data is described as angles and distances, the angles cluster together in predictable patterns. Predictable patterns don't need a cheat sheet - the system already knows what to expect. No cheat sheet = no overhead.

Step 2

QJL - clean up the tiny errors left behind

Step 1 handles about 90% of the work. The small errors left over get cleaned up by a 1-bit algorithm called QJL. It reduces each leftover error to a simple thumbs-up (+1) or thumbs-down (-1). No extra storage needed.

Analogy: Like running spell-check after you've already edited a document. Fast, lightweight, catches what the main edit missed.

Here’s the full picture:

The results

Google tested TurboQuant on open-source AI models (Gemma and Mistral) across multiple benchmarks, including “needle in a haystack” tests - finding one specific fact buried inside thousands of pages of text. It passed them all perfectly.

smaller memory

accuracy lost

faster attention

Zero

retraining needed

It also speeds up search

TurboQuant isn’t just for chatbots. It also helps with vector search - the technology behind “find me something similar” in search engines and recommendation systems.

When a search engine understands the meaning of a search query (not just matching keywords), it compares that query against millions of stored data points called vectors. Compressing those vectors means faster searches and less hardware. TurboQuant outperformed existing methods here too.

What changes for you

📄

Longer context windows

AI tools will be able to process much bigger documents without hitting limits

⚡

Faster responses

Less memory pressure means the AI can generate answers more quickly

💰

Cheaper AI services

More users per GPU means lower costs that get passed down to end users

🔍

Better search

Semantic search that understands meaning, not just keywords, becomes faster and cheaper at scale

The one-line takeaway

TurboQuant compresses an AI model's working memory from 32 bits to 3 bits per number - and loses nothing. That's like replacing a filing cabinet with a sticky note that somehow holds the same information.

Go deeper

Google Research blog post - the original announcement
TurboQuant paper (ICLR 2026) - full technical details
PolarQuant paper (AISTATS 2026) - the polar coordinate compression method