Create a Video View Paper

Do Transformers Really Need Three Projections?

This presentation examines a fundamental assumption in transformer architecture: the necessity of three separate learned projections for query, key, and value in self-attention. Through systematic experiments across synthetic tasks, vision benchmarks, and language models up to 1.2 billion parameters, the research reveals that sharing key and value projections (Q-K=V) delivers a 50% reduction in inference cache with less than 3% perplexity degradation, challenging decades of architectural convention and offering immediate benefits for memory-constrained deployment.

Script

Every transformer you've used today, from GPT to your translation app, computes attention using three separate weight matrices: query, key, and value. But what if two of them are redundant?

The researchers tested three projection-sharing variants across synthetic reasoning, vision, and language tasks. Sharing key and value projections cuts the inference cache in half, while collapsing all three projections catastrophically degrades quality by over 25%.

The Q minus K equals V architecture preserves attention asymmetry by keeping the query projection independent. This directional bias turns out to be critical for causal language modeling, where the model must distinguish past context from future predictions.

At 300 million parameters, Q minus K equals V achieves 3.1% perplexity degradation. Scaling to 1.2 billion parameters, that gap shrinks to 2.48%. When combined with grouped query attention, cache reduction reaches 96.9% with less than 5% quality loss.

Why does key-value sharing work while query-key sharing fails? The authors measured high cosine similarity between trained key and value projections, revealing functional redundancy. Meanwhile, forcing query to equal key introduces symmetry that breaks the causal structure language models depend on.

This work proves that three projections are not fundamental to attention, just convenient. For edge deployment and long-context serving, Q minus K equals V is already practical. Explore the full paper and create your own video summaries at EmergentMind.com.