Learn from Your Own Latents: Why Predicting Internal Representations Beats Predicting Tokens
This presentation unpacks a groundbreaking theoretical result: self-supervised learning that predicts latent representations instead of raw tokens achieves exponentially better sample efficiency. Using a formal hierarchical grammar model, the authors prove that latent-based objectives scale independently of hierarchy depth, while token-level pretraining suffers exponential penalties. The talk walks through the key mechanisms, empirical validations, and implications for both artificial and biological learning.Script
Modern deep networks need vastly more training data than humans to learn the same tasks. This paper proves that the culprit is token-level pretraining, and that learning from your own latent representations instead can cut sample complexity exponentially.
The authors use a context-free grammar model with a latent hierarchy of depth L. Token-based self-supervision scales exponentially with L, requiring on the order of m to the L plus 1 samples. But latent-based learning achieves sample complexity of only m cubed, independent of depth.
The key mechanism is recursive latent clustering. At each level, the algorithm groups latent tuples by their empirical context vectors, recovering the hierarchical grammar structure layer by layer. This process propagates improved predictions upward through the network.
Experiments confirm the theory. Across hierarchy depths, recovery accuracy collapses onto a single curve when sample size is rescaled by m cubed. The same scaling emerges in data2vec, a real self-supervised architecture that predicts its own encoder representations.
Ablations reveal that local, layer-wise learning rules achieve the same sample complexity as end-to-end gradient descent. This suggests a biologically plausible mechanism for the superior data efficiency observed in human learners.
This work formalizes the advantage of learning from latents over learning from tokens, with immediate implications for designing data-efficient generative models. To explore more theoretical breakthroughs like this one and create your own research videos, visit EmergentMind.com.