2000 character limit reached
Memory Mosaics (2405.06394v3)
Published 10 May 2024 in cs.LG, cs.AI, and cs.NE
Abstract: Memory Mosaics are networks of associative memories working in concert to achieve a prediction task of interest. Like transformers, memory mosaics possess compositional capabilities and in-context learning capabilities. Unlike transformers, memory mosaics achieve these capabilities in comparatively transparent way ("predictive disentanglement"). We illustrate these capabilities on a toy example and also show that memory mosaics perform as well or better than transformers on medium-scale LLMing tasks.
- In-context language learning: Arhitectures and algorithms. arXiv preprint arXiv:2401.12973, 2024.
- Multiple kernel learning, conic duality, and the smo algorithm. In Proceedings of the twenty-first international conference on Machine learning, page 6, 2004.
- Neural machine translation by jointly learning to align and translate. In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.
- Yoshua Bengio. Deep learning of representations: Looking forward. In Statistical Language and Speech Processing: First International Conference, SLSP 2013, Tarragona, volume 7978, page 1. Springer, 2013.
- A meta-transfer objective for learning to disentangle causal mechanisms. arXiv preprint arXiv:1901.10912, 2019.
- Birth of a transformer: A memory viewpoint. Advances in Neural Information Processing Systems, 36, 2024.
- Slide: In defense of smart algorithms over hardware acceleration for large-scale deep learning systems. Proceedings of Machine Learning and Systems, 2:291–306, 2020.
- Pierre Comon. Independent Component Analysis, a new concept? Signal Processing, 36:287–314, April 1994.
- Tinystories: How small can language models be and still speak coherent english? arXiv preprint arXiv:2305.07759, 2023.
- The fast Gauss transform. SIAM Journal on Scientific and Statistical Computing, 12(1):79–94, 1991.
- Hippo: Recurrent memory with optimal polynomial projections. Advances in neural information processing systems, 33:1474–1487, 2020.
- Adaptive mixtures of local experts. Neural computation, 3(1):79–87, 1991.
- Mixtral of experts, 2024.
- Inferring algorithmic patterns with stack-augmented recurrent nets. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015.
- Generalization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks. In International conference on machine learning, pages 2873–2882. PMLR, 2018.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- In-context learning and induction heads, 2022.
- Train short, test long: Attention with linear biases enables input length extrapolation. In International Conference on Learning Representations, 2022.
- Improving language understanding by generative pre-training, 2018.
- Language models are unsupervised multitask learners, 2019.
- Disentanglement of correlated factors via hausdorff factorized support. In The Eleventh International Conference on Learning Representations, 2022.
- Learning internal representations by error propagation. In Parallel distributed processing: Explorations in the microstructure of cognition, volume I, pages 318–362. Bradford Books, Cambridge, MA, 1986.
- Dynamic routing between capsules. Advances in neural information processing systems, 30, 2017.
- A new unbiased and efficient class of lsh-based samplers and estimators for partition function computation in log-linear models, 2017.
- Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024.
- End-to-end memory networks. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015.
- Augmenting self-attention with persistent memory, 2019.
- Disentangling the independently controllable factors of variation by interacting with the world. arXiv preprint arXiv:1802.09484, 2018.
- Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
- Legendre memory units: Continuous-time representation in recurrent neural networks. Advances in neural information processing systems, 32, 2019.
- Memory networks. arXiv preprint arXiv:1410.3916, 2014.
- Efficient streaming language models with attention sinks, 2023.
- Improved fast Gauss transform and efficient kernel density estimation. In Proceedings ninth IEEE International Conference on Computer Cision, pages 664–671. IEEE, 2003.