Probabilistic Transformer: A Probabilistic Dependency Model for Contextual Word Representation (2311.15211v1)
Abstract: Syntactic structures used to play a vital role in NLP, but since the deep learning revolution, NLP has been gradually dominated by neural models that do not consider syntactic structures in their design. One vastly successful class of neural models is transformers. When used as an encoder, a transformer produces contextual representation of words in the input sentence. In this work, we propose a new model of contextual word representation, not from a neural perspective, but from a purely syntactic and probabilistic perspective. Specifically, we design a conditional random field that models discrete latent representations of all words in a sentence as well as dependency arcs between them; and we use mean field variational inference for approximate inference. Strikingly, we find that the computation graph of our model resembles transformers, with correspondences between dependencies and self-attention and between distributions over latent representations and contextual embeddings of words. Experiments show that our model performs competitively to transformers on small to medium sized datasets. We hope that our work could help bridge the gap between traditional syntactic and probabilistic approaches and cutting-edge neural approaches to NLP, and inspire more linguistically-principled neural approaches in the future.
- Gate: graph attention transformer encoder for cross-lingual relation and event extraction. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 12462–12470.
- FLAIR: An easy-to-use framework for state-of-the-art NLP. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pages 54–59, Minneapolis, Minnesota. Association for Computational Linguistics.
- Bllip 1987–89 wsj corpus release 1, ldc no. LDC2000T43. Linguistic Data Consortium.
- What does BERT look at? an analysis of BERT’s attention. In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 276–286, Florence, Italy. Association for Computational Linguistics.
- Alexis Conneau and Douwe Kiela. 2018. Senteval: An evaluation toolkit for universal sentence representations. arXiv preprint arXiv:1803.05449.
- Universal dependencies. Computational linguistics, 47(2):255–308.
- Universal transformers. In International Conference on Learning Representations.
- Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186.
- Marco Dinarelli and Loïc Grobol. 2019. Seq2biseq: Bidirectional output-wise recurrent neural networks for sequence modelling. arXiv preprint arXiv:1904.04733.
- Attention is not all you need: pure attention loses rank doubly exponentially with depth. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 2793–2803. PMLR.
- A mathematical framework for transformer circuits. Transformer Circuits Thread.
- Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space. arXiv preprint arXiv:2203.14680.
- Transformer feed-forward layers are key-value memories. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5484–5495, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- RealFormer: Transformer likes residual attention. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 929–943, Online. Association for Computational Linguistics.
- John Hewitt and Christopher D. Manning. 2019. A structural probe for finding syntax in word representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4129–4138, Minneapolis, Minnesota. Association for Computational Linguistics.
- Do attention heads in bert track syntactic dependencies? arXiv preprint arXiv:1911.12246.
- A systematic assessment of syntactic generalization in neural language models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1725–1744, Online. Association for Computational Linguistics.
- Najoung Kim and Tal Linzen. 2020. COGS: A compositional generalization challenge based on semantic interpretation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9087–9105, Online. Association for Computational Linguistics.
- Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.
- Learned incremental representations for parsing. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3086–3095, Dublin, Ireland. Association for Computational Linguistics.
- Albert: A lite bert for self-supervised learning of language representations. In International Conference on Learning Representations, pages 1–16.
- Đ Khuê Lê-Huu and Karteek Alahari. 2021. Regularized frank-wolfe for dense crfs: Generalizing mean field and beyond. Advances in Neural Information Processing Systems, 34:1453–1467.
- Roberta: A robustly optimized bert pretraining approach. In International Conference on Learning Representations, pages 1–15.
- Building a large annotated corpus of english: The penn treebank. Comput. Linguist., 19(2):313–330.
- Tomáš Mikolov et al. 2012. Statistical language models based on neural networks. Presentation at Google, Mountain View, 2nd April, 80:26.
- Making transformers solve compositional tasks. arXiv preprint arXiv:2108.04378.
- Attention can reflect syntactic structure (if you let it). In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 3031–3045, Online. Association for Computational Linguistics.
- Self-attention with relative position representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 464–468, New Orleans, Louisiana. Association for Computational Linguistics.
- Unsupervised dependency graph network. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4767–4784, Dublin, Ireland. Association for Computational Linguistics.
- StructFormer: Joint unsupervised induction of dependency and constituency structure from masked language modeling. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 7196–7209, Online. Association for Computational Linguistics.
- Solomon Eyal Shimony. 1994. Finding maps for belief networks is np-hard. Artificial intelligence, 68(2):399–410.
- Daniel D. Sleator and Davy Temperley. 1993. Parsing English with a link grammar. In Proceedings of the Third International Workshop on Parsing Technologies, pages 277–292, Tilburg, Netherlands and Durbuy, Belgium. Association for Computational Linguistics.
- Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1631–1642, Seattle, Washington, USA. Association for Computational Linguistics.
- Linguistically-informed self-attention for semantic role labeling. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 5027–5038, Brussels, Belgium. Association for Computational Linguistics.
- Augmenting self-attention with persistent memory. arXiv preprint arXiv:1907.01470.
- What do you learn from context? probing for sentence structure in contextualized word representations. In International Conference on Learning Representations.
- Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, pages 142–147.
- Attention is all you need. Advances in neural information processing systems, 30.
- Xinyu Wang and Kewei Tu. 2020. Second-order neural dependency parsing with message passing and end-to-end training. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, pages 93–99, Suzhou, China. Association for Computational Linguistics.
- Tree transformer: Integrating tree structures into self-attention. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1061–1070, Hong Kong, China. Association for Computational Linguistics.
- On layer normalization in the transformer architecture. In International Conference on Machine Learning, pages 10524–10533. PMLR.