Create a Video View Paper

Lost in Backpropagation: The LM Head is a Gradient Bottleneck

This presentation examines a critical architectural flaw in modern language models: the LM head destroys up to 99% of the gradient signal during training. Through theoretical analysis and empirical evidence, we explore how this bottleneck slows convergence, impairs learning of even trivial patterns, and has profound implications for large-scale pretraining efficiency.

Script

Language models discard 99% of their learning signal during training. The culprit isn't the architecture's depth or attention mechanism—it's the final output layer itself, systematically destroying gradient information as it flows backward through the network.

The standard language model head creates a severe dimensional mismatch. With hidden dimensions around 4,000 but vocabularies exceeding 50,000 tokens, the output layer forces all gradient information through a bottleneck that can only preserve a tiny fraction of the supervision signal.

This architectural choice has consequences that extend far beyond simple expressivity limits.

While previous work identified the expressivity bottleneck, this research reveals a second, more severe problem. The LM head doesn't just limit which distributions the model can represent—it actively destroys gradient information during backpropagation, creating an optimization bottleneck that slows learning regardless of the model's representational capacity.

The SpamLang experiment isolates the optimization effect with surgical precision. Models were trained on sequences where each sample simply repeats one symbol—a pattern so trivial it should be learnable immediately. Yet as vocabulary size grew beyond 100,000 tokens, models failed entirely, unable to learn even this elementary rule because the gradient bottleneck had become insurmountable.

This isn't an artifact of one architecture. Analysis of major model families reveals the same pattern: 95 to 99% gradient loss, with supervision signal scattered into uninformative directions. The effect persists whether embeddings are tied or untied, and downstream performance continues improving even at large output dimensions—evidence that we haven't yet hit the expressivity ceiling, only the optimization one.

The LM head bottleneck suggests that modern language models are training with one hand tied behind their back, discarding most of their learning signal at every step. Addressing this architectural constraint could unlock substantial efficiency gains without requiring more data or compute—just better gradient flow. Visit EmergentMind.com to explore this research further and create your own AI-narrated presentations.