Structural Optimization Ambiguity and Simplicity Bias in Unsupervised Neural Grammar Induction (2407.16181v1)
Abstract: Neural parameterization has significantly advanced unsupervised grammar induction. However, training these models with a traditional likelihood loss for all possible parses exacerbates two issues: 1) $\textit{structural optimization ambiguity}$ that arbitrarily selects one among structurally ambiguous optimal grammars despite the specific preference of gold parses, and 2) $\textit{structural simplicity bias}$ that leads a model to underutilize rules to compose parse trees. These challenges subject unsupervised neural grammar induction (UNGI) to inevitable prediction errors, high variance, and the necessity for extensive grammars to achieve accurate predictions. This paper tackles these issues, offering a comprehensive analysis of their origins. As a solution, we introduce $\textit{sentence-wise parse-focusing}$ to reduce the parse pool per sentence for loss evaluation, using the structural bias from pre-trained parsers on the same dataset. In unsupervised parsing benchmark tests, our method significantly improves performance while effectively reducing variance and bias toward overly simplistic parses. Our research promotes learning more compact, accurate, and consistent explicit grammars, facilitating better interpretability.
- Glenn Carroll and Eugene Charniak. 1992. Two experiments on learning probabilistic dependency grammars from corpora. In AAAI Workshop on Statistically-Based NLP Techniques.
- Zhiyi Chi and Stuart Geman. 1998. Estimation of probabilistic context-free grammars. Computational linguistics, 24(2):299–305.
- Alexander Clark. 2001. Unsupervised induction of stochastic context-free grammars using distributional clustering. In Proceedings of the 2001 Workshop on Computational Natural Language Learning - Volume 7, ConLL ’01, USA. Association for Computational Linguistics.
- Anna Corazza and Giorgio Satta. 2006. Cross-entropy and estimation of probabilistic context-free grammars. In Proceedings of the Human Language Technology Conference of the NAACL, Main Conference, pages 335–342.
- Unsupervised parsing with S-DIORA: Single tree encoding for deep inside-outside recursive autoencoders. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4832–4845, Online. Association for Computational Linguistics.
- Unsupervised latent tree induction with deep inside-outside recursive autoencoders.
- Recurrent neural network grammars. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 199–209, San Diego, California. Association for Computational Linguistics.
- Taiga Ishii and Yusuke Miyao. 2023. Tree-shape uncertainty for analyzing the inherent branching bias of unsupervised parsing models. In Proceedings of the 27th Conference on Computational Natural Language Learning (CoNLL), pages 532–547, Singapore. Association for Computational Linguistics.
- Compound probabilistic context-free grammars for grammar induction. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2369–2385.
- Unsupervised recurrent neural network grammars. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1105–1117, Minneapolis, Minnesota. Association for Computational Linguistics.
- Dan Klein and Christopher D Manning. 2001. Natural language grammar induction using a constituent-context model. Advances in neural information processing systems, 14.
- Dan Klein and Christopher D Manning. 2002. A generative constituent-context model for improved grammar induction. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 128–135.
- Karim Lari and Steve J Young. 1990. The estimation of stochastic context-free grammars using the inside-outside algorithm. Computer speech & language, 4(1):35–56.
- The penn treebank: Annotating predicate argument structure. In Human Language Technology: Proceedings of a Workshop held at Plainsboro, New Jersey, March 8-11, 1994.
- Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19(2):313–330.
- Mark-Jan Nederhof and Giorgio Satta. 2006. Estimation of consistent probabilistic context-free grammars. In Proceedings of the Main Conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, HLT-NAACL ’06, page 343–350, USA. Association for Computational Linguistics.
- Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32.
- Learning accurate, compact, and interpretable tree annotation. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, pages 433–440.
- Introducing the SPMRL 2014 shared task on parsing morphologically-rich languages. In Proceedings of the First Joint Workshop on Statistical Parsing of Morphologically Rich Languages and Syntactic Analysis of Non-Canonical Languages, pages 103–109, Dublin, Ireland. Dublin City University.
- Ensemble distillation for unsupervised constituency parsing.
- Neural language modeling by jointly learning syntax and lexicon.
- Ordered neurons: Integrating tree structures into recurrent neural networks.