Bidirectional Attention as a Mixture of Continuous Word Experts (2307.04057v2)
Abstract: Bidirectional attention $\unicode{x2013}$ composed of self-attention with positional encodings and the masked LLM (MLM) objective $\unicode{x2013}$ has emerged as a key component of modern LLMs. Despite its empirical success, few studies have examined its statistical underpinnings: What statistical model is bidirectional attention implicitly fitting? What sets it apart from its non-attention predecessors? We explore these questions in this paper. The key observation is that fitting a single-layer single-head bidirectional attention, upon reparameterization, is equivalent to fitting a continuous bag of words (CBOW) model with mixture-of-experts (MoE) weights. Further, bidirectional attention with multiple heads and multiple layers is equivalent to stacked MoEs and a mixture of MoEs, respectively. This statistical viewpoint reveals the distinct use of MoE in bidirectional attention, which aligns with its practical effectiveness in handling heterogeneous data. It also suggests an immediate extension to categorical tabular data, if we view each word location in a sentence as a tabular feature. Across empirical studies, we find that this extension outperforms existing tabular extensions of transformers in out-of-distribution (OOD) generalization. Finally, this statistical perspective of bidirectional attention enables us to theoretically characterize when linear word analogies are present in its word embeddings. These analyses show that bidirectional attention can require much stronger assumptions to exhibit linear word analogies than its non-attention predecessors.
- Transformers learn to implement preconditioned gradient descent for in-context learning. arXiv preprint arXiv:2306.00297, 2023.
- In-context learning through the Bayesian prism. arXiv preprint arXiv:2306.04891, 2023.
- What learning algorithm is in-context learning? Investigations with linear models. In International Conference on Learning Representations, 2022.
- C. Allen and T. Hospedales. Analogies explained: Towards understanding word embeddings. In International Conference on Machine Learning, 2019.
- S. Ö. Arik and T. Pfister. TabNet: Attentive interpretable tabular learning. In AAAI Conference on Artificial Intelligence, 2021.
- A latent variable model approach to PMI-based word embeddings. In Association for Computational Linguistics, 2016.
- Transformers as statisticians: Provable in-context learning with in-context algorithm selection. arXiv preprint arXiv:2306.04637, 2023.
- K. Church and P. Hanks. Word association norms, mutual information, and lexicography. Computational Linguistics, 1990.
- Why can GPT learn in-context? Language models secretly perform gradient descent as meta optimizers. In Association for Computational Linguistics, 2023.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In North American Chapter of the Association for Computational Linguistics - Human Language Technologies, 2019.
- Inductive biases and variable creation in self-attention mechanisms. In International Conference on Machine Learning, 2022.
- A mathematical framework for transformer circuits. Transformer Circuits Thread, 2021.
- Towards understanding linear word analogies. In Association for Computational Linguistics, 2019.
- Skip-gram - Zipf + uniform = vector additivity. In Association for Computational Linguistics, 2017.
- Revisiting deep learning models for tabular data. In Neural Information Processing Systems, 2021.
- TabTransformer: Tabular data modeling using contextual embeddings. arXiv preprint arXiv:2012.06678, 2020.
- Adaptive mixtures of local experts. Neural Computation, 1991.
- M. Joseph. PyTorch Tabular: A framework for deep learning with tabular data. arXiv preprint arXiv:2104.13638, 2021.
- O. Levy and Y. Goldberg. Neural word embedding as implicit matrix factorization. In Neural Information Processing Systems, 2014.
- How do transformers learn topic structure: Towards a mechanistic understanding. In International Conference on Machine Learning, 2023.
- RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
- Distributed representations of words and phrases and their compositionality. In Neural Information Processing Systems, 2013.
- A mixture of h−1ℎ1h-1italic_h - 1 heads is better than hℎhitalic_h heads. In Association for Computational Linguistics, 2020.
- GloVe: Global vectors for word representation. In Empirical Methods in Natural Language Processing, 2014.
- Language models are unsupervised multitask learners. OpenAI Blog, 2019.
- AutoInt: Automatic feature interaction learning via self-attentive neural networks. In ACM International Conference on Information and Knowledge Management, 2019.
- Attention word embedding. arXiv preprint arXiv:2006.00988, 2020.
- M. Sugiyama and A. J. Storkey. Mixture regression for covariate shift. In Neural Information Processing Systems, 2006.
- Transformer dissection: An unified understanding for transformer’s attention via the lens of kernel. In Empirical Methods in Natural Language Processing and International Joint Conference on Natural Language Processing, 2019.
- BERT is to NLP what AlexNet is to CV: Can pre-trained language models identify analogies? In Association for Computational Linguistics and International Joint Conference on Natural Language Processing, 2021.
- Attention is all you need. In Neural Information Processing Systems, 2017.
- Transformers learn in-context by gradient descent. In International Conference on Machine Learning, 2023.
- Large language models are latent variable models: Explaining and finding good demonstrations for in-context learning. In Neural Information Processing Systems, 2023.
- An explanation of in-context learning as implicit Bayesian inference. In International Conference on Learning Representations, 2021.
- Trained transformers learn linear models in-context. In Workshop on Robustness of Zero/Few-Shot Learning in Foundation Models at NeurIPS, 2023.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.