Unsupervised Parsing by Searching for Frequent Word Sequences among Sentences with Equivalent Predicate-Argument Structures (2404.12059v2)
Abstract: Unsupervised constituency parsing focuses on identifying word sequences that form a syntactic unit (i.e., constituents) in target sentences. Linguists identify the constituent by evaluating a set of Predicate-Argument Structure (PAS) equivalent sentences where we find the constituent appears more frequently than non-constituents (i.e., the constituent corresponds to a frequent word sequence within the sentence set). However, such frequency information is unavailable in previous parsing methods that identify the constituent by observing sentences with diverse PAS. In this study, we empirically show that constituents correspond to frequent word sequences in the PAS-equivalent sentence set. We propose a frequency-based parser span-overlap that (1) computes the span-overlap score as the word sequence's frequency in the PAS-equivalent sentence set and (2) identifies the constituent structure by finding a constituent tree with the maximum span-overlap score. The parser achieves state-of-the-art level parsing accuracy, outperforming existing unsupervised parsers in eight out of ten languages. Additionally, we discover a multilingual phenomenon: participant-denoting constituents tend to have higher span-overlap scores than equal-length event-denoting constituents, meaning that the former tend to appear more frequently in the PAS-equivalent sentence set than the latter. The phenomenon indicates a statistical difference between the two constituent types, laying the foundation for future labeled unsupervised parsing research.
- Constituency parsing using llms. CoRR, abs/2310.19462.
- Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan. Association for Computational Linguistics.
- Unsupervised parsing via constituency tests. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4798–4808, Online. Association for Computational Linguistics.
- Andrew Carnie. 2007. Syntax: A Generative Introduction. Blackwell, Malden, MA.
- INSTRUCTEVAL: towards holistic evaluation of instruction-tuned large language models. CoRR, abs/2306.04757.
- A new coding technique for asynchronous multiple access communication. IEEE Transactions on Communication Technology, 19(5):849–855.
- From dependencies to constituents in the reference corpus for the processing of basque (EPEC). Proces. del Leng. Natural, 41.
- A masked segmental language model for unsupervised natural language segmentation. In Proceedings of the 19th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology, pages 39–50, Seattle, Washington. Association for Computational Linguistics.
- Ralph W. Fasold and Jeff Connor-Linton. 2014. An Introduction to Language and Linguistics, 2 edition. Cambridge University Press.
- Neural language models as psycholinguistic subjects: Representations of syntactic state. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 32–42, Minneapolis, Minnesota. Association for Computational Linguistics.
- A systematic assessment of syntactic generalization in neural language models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1725–1744, Online. Association for Computational Linguistics.
- Jie Huang and Kevin Chen-Chuan Chang. 2023. Towards reasoning in large language models: A survey. In Findings of the Association for Computational Linguistics: ACL 2023, pages 1049–1065, Toronto, Canada. Association for Computational Linguistics.
- Compound probabilistic context-free grammars for grammar induction. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2369–2385, Florence, Italy. Association for Computational Linguistics.
- Dan Klein and Christopher D. Manning. 2002. A generative constituent-context model for improved grammar induction. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 128–135, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
- Jiaxi Li and Wei Lu. 2023. Contextual distortion reveals constituency: Masked language models are implicit parsers. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5208–5222, Toronto, Canada. Association for Computational Linguistics.
- WANLI: Worker and AI collaboration for natural language inference dataset creation. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 6826–6847, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Improving neural language models by segmenting, attending, and predicting the future. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1483–1493, Florence, Italy. Association for Computational Linguistics.
- Treebank-3. Linguistic Data Consortium.
- The Proposition Bank: An annotated corpus of semantic roles. Computational Linguistics, 31(1):71–106.
- Chinese treebank 5.0. Linguistic Data Consortium.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
- Discovering language model behaviors with model-written evaluations. In Findings of the Association for Computational Linguistics: ACL 2023, pages 13387–13434, Toronto, Canada. Association for Computational Linguistics.
- Evaluation of the performance of gpt-3.5 and gpt-4 on the polish medical final examination. Scientific Reports, 13.
- Riccardo Russo. 2020. Statistics for the Behavioural Sciences: An Introduction to Frequentist and Bayesian Approaches. Routledge.
- Overview of the SPMRL 2013 shared task: A cross-framework evaluation of parsing morphologically rich languages. In Proceedings of the Fourth Workshop on Statistical Parsing of Morphologically-Rich Languages, pages 146–182, Seattle, Washington, USA. Association for Computational Linguistics.
- Mohamed L. Seghier. 2023. Chatgpt: not all languages are equal. Nature, 615:216.
- Neural language modeling by jointly learning syntax and lexicon. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net.
- A new version of the składnica treebank of Polish harmonised with the walenty valency dictionary. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA).
- Perturbed masking: Parameter-free probing for analyzing and interpreting BERT. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4166–4176, Online. Association for Computational Linguistics.
- Out-of-distribution generalization in natural language processing: Past, present, and future. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4533–4559, Singapore. Association for Computational Linguistics.
- PCFGs can do better: Inducing probabilistic context-free grammars with many symbols. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1487–1498, Online. Association for Computational Linguistics.
- Yanpeng Zhao and Ivan Titov. 2020. Visually grounded compound PCFGs. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4369–4379, Online. Association for Computational Linguistics.
- Instruction-following evaluation for large language models. CoRR, abs/2311.07911.