Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MRL Parsing Without Tears: The Case of Hebrew (2403.06970v1)

Published 11 Mar 2024 in cs.CL

Abstract: Syntactic parsing remains a critical tool for relation extraction and information extraction, especially in resource-scarce languages where LLMs are lacking. Yet in morphologically rich languages (MRLs), where parsers need to identify multiple lexical units in each token, existing systems suffer in latency and setup complexity. Some use a pipeline to peel away the layers: first segmentation, then morphology tagging, and then syntax parsing; however, errors in earlier layers are then propagated forward. Others use a joint architecture to evaluate all permutations at once; while this improves accuracy, it is notoriously slow. In contrast, and taking Hebrew as a test case, we present a new "flipped pipeline": decisions are made directly on the whole-token units by expert classifiers, each one dedicated to one specific task. The classifiers are independent of one another, and only at the end do we synthesize their predictions. This blazingly fast approach sets a new SOTA in Hebrew POS tagging and dependency parsing, while also reaching near-SOTA performance on other Hebrew NLP tasks. Because our architecture does not rely on any language-specific resources, it can serve as a model to develop similar parsers for other MRLs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (24)
  1. Dan Bareket and Reut Tsarfaty. 2021. Neural modeling for named entities and morphology (NEMO2). Transactions of the Association for Computational Linguistics, 9:909–928.
  2. Shay B. Cohen and Noah A. Smith. 2007. Joint morphological and syntactic disambiguation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pages 208–217, Prague, Czech Republic. Association for Computational Linguistics.
  3. Universal Dependencies. Computational Linguistics, 47(2):255–308.
  4. Multilingual sequence-to-sequence models for Hebrew NLP. In Findings of the Association for Computational Linguistics: ACL 2023, pages 7700–7708, Toronto, Canada. Association for Computational Linguistics.
  5. Yoav Goldberg and Reut Tsarfaty. 2008. A single generative model for joint morphological segmentation and syntactic parsing. In Proceedings of ACL-08: HLT, pages 371–379, Columbus, Ohio. Association for Computational Linguistics.
  6. Omer Goldman and Reut Tsarfaty. 2022. Morphology without borders: Clause-level morphology. Transactions of the Association for Computational Linguistics, 10:1455–1472.
  7. Spence Green and Christopher D. Manning. 2010. Better Arabic parsing: Baselines, evaluations, and analysis. In Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), pages 394–402, Beijing, China. Coling 2010 Organizing Committee.
  8. Large pre-trained models with extra-large vocabularies: A contrastive analysis of hebrew bert models and a new one to outperform them all.
  9. Eliyahu Kiperwasser and Yoav Goldberg. 2016. Simple and accurate dependency parsing using bidirectional LSTM feature representations. Transactions of the Association for Computational Linguistics, 4:313–327.
  10. Stav Klein and Reut Tsarfaty. 2020. Getting the ##life out of living: How adequate are word-pieces for modelling complex morphology? In Proceedings of the 17th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology, pages 204–209, Online. Association for Computational Linguistics.
  11. Keep it surprisingly simple: A simple first order graph based parsing model for joint morphosyntactic parsing in Sanskrit. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4791–4797, Online. Association for Computational Linguistics.
  12. A graph-based framework for structured prediction tasks in Sanskrit. Computational Linguistics, 46(4):785–845.
  13. Danit Yshaayahu Levi and Reut Tsarfaty. 2024. A truly joint neural architecture for segmentation and parsing.
  14. Joint transition-based models for morpho-syntactic parsing: Parsing strategies for MRLs and a case study from Modern Hebrew. Transactions of the Association for Computational Linguistics, 7:33–48.
  15. Trankit: A light-weight transformer-based toolkit for multilingual natural language processing.
  16. Stanza: A python natural language processing toolkit for many human languages. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 101–108, Online. Association for Computational Linguistics.
  17. The Hebrew Universal Dependency treebank: Past present and future. In Proceedings of the Second Workshop on Universal Dependencies (UDW 2018), pages 133–143, Brussels, Belgium. Association for Computational Linguistics.
  18. Wolfgang Seeker and Özlem Çetinoğlu. 2015. A graph-based lattice dependency parser for joint morphological segmentation and syntactic analysis. Transactions of the Association for Computational Linguistics, 3:359–373.
  19. Alephbert:a hebrew large pre-trained language model to start-off your hebrew nlp application with.
  20. Amit Seker and Reut Tsarfaty. 2020. A pointer network architecture for joint morphological segmentation and tagging. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4368–4378, Online. Association for Computational Linguistics.
  21. Dictabert: A state-of-the-art bert suite for modern hebrew.
  22. Reut Tsarfaty. 2006. Integrated morphological and syntactic disambiguation for Modern Hebrew. In Proceedings of the COLING/ACL 2006 Student Research Workshop, pages 49–54, Sydney, Australia. Association for Computational Linguistics.
  23. What’s wrong with Hebrew NLP? and how to make it right. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): System Demonstrations, pages 259–264, Hong Kong, China. Association for Computational Linguistics.
  24. Shaked Yehezkel and Yuval Pinter. 2023. Incorporating context into subword vocabularies. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 623–635, Dubrovnik, Croatia. Association for Computational Linguistics.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Shaltiel Shmidman (10 papers)
  2. Avi Shmidman (13 papers)
  3. Moshe Koppel (16 papers)
  4. Reut Tsarfaty (54 papers)
Citations (1)