A Mathematical Theory of Memory: Unifying Deep Learning Through Compression

This presentation explores a comprehensive theoretical framework that reconceptualizes deep representation learning as a principled process of memory formation through lossy compression. The authors unify classical methods like PCA with modern deep networks by showing both are solving the same fundamental problem: extracting low-dimensional structure from high-dimensional data. The framework provides rigorous mathematical guarantees for optimization, explains the phase transitions between memorization and generalization, and offers practical design principles for neural architectures grounded in information theory rather than trial and error.
Script
What if every deep learning system, from the simplest PCA to the most complex transformer, is fundamentally performing the same mathematical operation: compressing memory? This paper presents a unified theory of representation learning where intelligence itself is the art of efficient coding.
The authors ground their framework in a single empirical fact: natural data, whether images, text, or motion, concentrates on low-dimensional manifolds embedded in high-dimensional space. Every representation learning system, classical or modern, is trying to discover and parameterize these hidden manifolds.
Here is where the theory becomes radical: deep networks are not black boxes but unrolled optimization algorithms for compression. Each layer incrementally reduces coding rate, transforming opaque architectures into transparent implementations of entropy minimization with provable convergence guarantees.
The mathematics reveals something extraordinary: the coding rate reduction objective has a benign landscape where every local minimum is globally optimal, and all other critical points are strict saddles. This explains why gradient descent works at all in deep learning, a mystery that has persisted for decades.
But the theory also predicts failure modes: at extreme coding rates, learning collapses into either pure memorization or trivial lazy regimes. The framework quantitatively characterizes these phase transitions, explaining overfitting, double descent, and neural collapse as inevitable consequences of rate-distortion geometry. Not all compression leads to intelligence.
The authors position this not as the end but as a foundation: memory through compression is how current artificial intelligence operates, but it remains empirical and reactive, far from the deductive, hypothesis-generating intelligence we associate with science. To learn more about this mathematical theory of memory and explore other research, visit EmergentMind.com where you can create your own videos from cutting-edge papers.