Tree-Planted Transformers: Unidirectional Transformer Language Models with Implicit Syntactic Supervision (2402.12691v2)
Abstract: Syntactic LLMs (SLMs) can be trained efficiently to reach relatively high performance; however, they have trouble with inference efficiency due to the explicit generation of syntactic structures. In this paper, we propose a new method dubbed tree-planting: instead of explicitly generating syntactic structures, we "plant" trees into attention weights of unidirectional Transformer LMs to implicitly reflect syntactic structures of natural language. Specifically, unidirectional Transformer LMs trained with tree-planting will be called Tree-Planted Transformers (TPT), which inherit the training efficiency from SLMs without changing the inference efficiency of their underlying Transformer LMs. Targeted syntactic evaluations on the SyntaxGym benchmark demonstrated that TPTs, despite the lack of explicit generation of syntactic structures, significantly outperformed not only vanilla Transformer LMs but also various SLMs that generate hundreds of syntactic structures in parallel. This result suggests that TPTs can learn human-like syntactic knowledge as data-efficiently as SLMs while maintaining the modeling space of Transformer LMs unchanged.
- Syntax-BERT: Improving Pre-trained Transformers with Syntax Trees. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 3011–3020, Online. Association for Computational Linguistics.
- Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. " O’Reilly Media, Inc.".
- Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
- Emanuele Bugliarello and Naoaki Okazaki. 2020. Enhancing Machine Translation with Dependency-Aware Self-Attention. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1618–1627, Online. Association for Computational Linguistics.
- BLLIP 1987-89 WSJ Corpus Release 1.
- Sudden Drops in the Loss: Syntax Acquisition, Phase Transitions, and Simplicity Bias in MLMs.
- Noam Chomsky. 1957. Syntactic Structures. Mouton, The Hague.
- Variable beam search for generative neural parsing and its relevance for the analysis of neuro-imaging signal. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1150–1160, Hong Kong, China. Association for Computational Linguistics.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
- Exploiting Syntactic Structure for Better Language Modeling: A Syntactic Distance Approach. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6611–6628, Online. Association for Computational Linguistics.
- Recurrent Neural Network Grammars. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 199–209, San Diego, California. Association for Computational Linguistics.
- SyntaxGym: An Online Platform for Targeted Evaluation of Language Models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 70–76, Online. Association for Computational Linguistics.
- Training Compute-Optimal Large Language Models.
- A Systematic Assessment of Syntactic Generalization in Neural Language Models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1725–1744, Online. Association for Computational Linguistics.
- Nikita Kitaev and Dan Klein. 2018. Constituency Parsing with a Self-Attentive Encoder. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2676–2686, Melbourne, Australia. Association for Computational Linguistics.
- What Do Recurrent Neural Network Grammars Learn About Syntax? In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 1249–1258, Valencia, Spain. Association for Computational Linguistics.
- Henry W. Lin and Max Tegmark. 2017. Critical Behavior in Physics and Probabilistic Formal Languages. Entropy, 19(7):299.
- DREEAM: Guiding Attention with Evidence for Improving Document-Level Relation Extraction. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 1971–1983, Dubrovnik, Croatia. Association for Computational Linguistics.
- Explosion/spaCy: V3.7.2: Fixes for APIs and requirements. Zenodo.
- Aaron Mueller and Tal Linzen. 2023. How to plant trees in language models: Data and architectural effects on the emergence of syntactic inductive biases. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11237–11252, Toronto, Canada. Association for Computational Linguistics.
- Pushdown Layers: Encoding Recursive Structure in Transformer Language Models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3233–3247, Singapore. Association for Computational Linguistics.
- Tree-structured attention with hierarchical accumulation. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net.
- Hiroshi Noji and Yohei Oseki. 2021. Effective Batching for Recurrent Neural Network Grammars. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 4340–4352, Online. Association for Computational Linguistics.
- PaLM: A Hybrid Parser and Language Model. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3644–3651, Hong Kong, China. Association for Computational Linguistics.
- Structural Guidance for Transformer Language Models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3735–3745, Online. Association for Computational Linguistics.
- Language models are unsupervised multitask learners.
- Scaling Language Models: Methods, Analysis & Insights from Training Gopher.
- Do Syntax Trees Help Pre-trained Transformers Extract Information? In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 2647–2661, Online. Association for Computational Linguistics.
- Transformer Grammars: Augmenting Transformer Language Models with Syntactic Inductive Biases at Scale.
- Neural language modeling by jointly learning syntax and lexicon. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net.
- Ordered neurons: Integrating tree structures into recurrent neural networks. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net.
- Semantics-aware Attention Improves Neural Machine Translation. In Proceedings of the 11th Joint Conference on Lexical and Computational Semantics, pages 28–43, Seattle, Washington. Association for Computational Linguistics.
- Effective Inference for Generative Neural Parsing. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1695–1700, Copenhagen, Denmark. Association for Computational Linguistics.
- Linguistically-Informed Self-Attention for Semantic Role Labeling. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 5027–5038, Brussels, Belgium. Association for Computational Linguistics.
- Attention is All you Need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
- Tree Transformer: Integrating Tree Structures into Self-Attention. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1061–1070, Hong Kong, China. Association for Computational Linguistics.
- Findings of the BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora. In Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning, pages 1–34, Singapore. Association for Computational Linguistics.
- Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
- Phrase-level Self-Attention Networks for Universal Sentence Encoding. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3729–3738, Brussels, Belgium. Association for Computational Linguistics.
- Ryo Yoshida and Yohei Oseki. 2022. Composition, Attention, or Both? In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 5822–5834, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Ryo Yoshida (26 papers)
- Taiga Someya (8 papers)
- Yohei Oseki (22 papers)