Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Forming Trees with Treeformers (2207.06960v2)

Published 14 Jul 2022 in cs.CL, cs.AI, and cs.LG

Abstract: Human language is known to exhibit a nested, hierarchical structure, allowing us to form complex sentences out of smaller pieces. However, many state-of-the-art neural networks models such as Transformers have no explicit hierarchical structure in its architecture -- that is, they don't have an inductive bias toward hierarchical structure. Additionally, Transformers are known to perform poorly on compositional generalization tasks which require such structures. In this paper, we introduce Treeformer, a general-purpose encoder module inspired by the CKY algorithm which learns a composition operator and pooling function to construct hierarchical encodings for phrases and sentences. Our extensive experiments demonstrate the benefits of incorporating hierarchical structure into the Transformer and show significant improvements in compositional generalization as well as in downstream tasks such as machine translation, abstractive summarization, and various natural language understanding tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. You only need attention to traverse trees. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 316–322, Florence, Italy. Association for Computational Linguistics.
  2. A large annotated corpus for learning natural language inference. arXiv preprint arXiv: Arxiv-1508.05326.
  3. James Bradbury and Richard Socher. 2017. Towards neural machine translation with latent tree attention. In Proceedings of the 2nd Workshop on Structured Prediction for Natural Language Processing, pages 12–16, Copenhagen, Denmark. Association for Computational Linguistics.
  4. SemEval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pages 1–14, Vancouver, Canada. Association for Computational Linguistics.
  5. WIT3: Web inventory of transcribed and translated talks. In Proceedings of the 16th Annual conference of the European Association for Machine Translation, pages 261–268, Trento, Italy. European Association for Machine Translation.
  6. Learning to compose task-specific tree structures. Aaai Conference On Artificial Intelligence.
  7. Learning to compose task-specific tree structures. In Thirty-Second AAAI Conference on Artificial Intelligence.
  8. N. Chomsky. 1956. Three models for the description of language. IRE Transactions on Information Theory, 2(3):113–124.
  9. John Cocke. 1969. Programming Languages and Their Compilers: Preliminary Notes. New York University, USA.
  10. The devil is in the detail: Simple tricks improve systematic generalization of transformers. Conference On Empirical Methods In Natural Language Processing.
  11. Bert: Pre-training of deep bidirectional transformers for language understanding. North American Chapter Of The Association For Computational Linguistics.
  12. William B. Dolan and Chris Brockett. 2005. Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP2005).
  13. Unsupervised latent tree induction with deep inside-outside recursive autoencoders. arXiv preprint arXiv: Arxiv-1904.02142.
  14. Recurrent neural network grammars. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 199–209, San Diego, California. Association for Computational Linguistics.
  15. English gigaword. Linguistic Data Consortium, Philadelphia, 4(1):34.
  16. Tree-transformer: A transformer-based method for correction of tree-structured data. arXiv preprint arXiv: Arxiv-1908.00449.
  17. R2d2: Recursive transformer based on differentiable tree for interpretable hierarchical language modeling. arXiv preprint arXiv: Arxiv-2107.00967.
  18. Tadao Kasami. 1965. An efficient recognition and syntax-analysis algorithm for context-free languages.
  19. Beyond reptile: Meta-learned dot-product maximization between gradients for improved single-task regularization. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 407–420, Punta Cana, Dominican Republic. Association for Computational Linguistics.
  20. Najoung Kim and Tal Linzen. 2020. COGS: A compositional generalization challenge based on semantic interpretation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9087–9105, Online. Association for Computational Linguistics.
  21. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions, pages 177–180, Prague, Czech Republic. Association for Computational Linguistics.
  22. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv: Arxiv-1909.11942.
  23. Phong Le and Willem Zuidema. 2015. Compositional distributional semantics with long short term memory. In Proceedings of the Fourth Joint Conference on Lexical and Computational Semantics, pages 10–19, Denver, Colorado. Association for Computational Linguistics.
  24. Heads-up! unsupervised constituency parsing via self-attention heads. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, pages 409–424, Suzhou, China. Association for Computational Linguistics.
  25. On compositional generalization of neural machine translation. Annual Meeting Of The Association For Computational Linguistics.
  26. Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
  27. Open sesame: Getting inside bert’s linguistic knowledge. arXiv preprint arXiv:1906.01698.
  28. Jointly learning sentence embeddings and syntax with unsupervised tree-lstms. arXiv preprint arXiv: Arxiv-1705.09189.
  29. Richard Montague. 1970. Universal grammar. Theoria, 36(3):373–398.
  30. Recursive tree-structured self-attention for answer sentence selection. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4651–4661, Online. Association for Computational Linguistics.
  31. Tree-structured attention with hierarchical accumulation. In International Conference on Learning Representations.
  32. fairseq: A fast, extensible toolkit for sequence modeling. arXiv preprint arXiv: Arxiv-1904.01038.
  33. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
  34. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc.
  35. Improving compositional generalization with latent structure and data augmentation. North American Chapter Of The Association For Computational Linguistics.
  36. A primer in bertology: What we know about how bert works. Transactions of the Association for Computational Linguistics, 8:842–866.
  37. A neural attention model for abstractive sentence summarization. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing.
  38. Transformer grammars: Augmenting transformer language models with syntactic inductive biases at scale. Transactions of the Association for Computational Linguistics, 10:1423–1439.
  39. Neural machine translation of rare words with subword units. arXiv preprint arXiv: Arxiv-1508.07909.
  40. Vighnesh Shiv and Chris Quirk. 2019. Novel positional encodings to enable tree-based transformers. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.
  41. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1631–1642, Seattle, Washington, USA. Association for Computational Linguistics.
  42. Improved semantic representations from tree-structured long short-term memory networks. arXiv preprint arXiv: Arxiv-1503.00075.
  43. The importance of being recurrent for modeling hierarchical structure. arXiv preprint arXiv: Arxiv-1803.03585.
  44. Attention is all you need. In Advances in Neural Information Processing Systems. Curran Associates, Inc.
  45. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv: Arxiv-1804.07461.
  46. Learning program representations with a tree-structured transformer. arXiv preprint arXiv: Arxiv-2208.08643.
  47. Tree transformer: Integrating tree structures into self-attention. arXiv preprint arXiv: Arxiv-1909.06639.
  48. Neural network acceptability judgments. arXiv preprint arXiv:1805.12471.
  49. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122. Association for Computational Linguistics.
  50. Pay less attention with lightweight and dynamic convolutions. arXiv preprint arXiv: Arxiv-1901.10430.
  51. Bert, mbert, or bibert? a study on contextualized embeddings for neural machine translation. arXiv preprint arXiv: Arxiv-2109.04588.
  52. Improved latent tree induction with distant supervision via span constraints. EMNLP.
  53. Generative and discriminative text classification with recurrent neural networks. arXiv preprint arXiv: Arxiv-1703.01898.
  54. Daniel H. Younger. 1966. Context-free language processing in time n3. In 7th Annual Symposium on Switching and Automata Theory (swat 1966), pages 7–20.
  55. Hao Zheng and Mirella Lapata. 2022. Real-world compositional generalization with disentangled sequence-to-sequence learning. ArXiv, abs/2212.05982.
Citations (2)

Summary

We haven't generated a summary for this paper yet.