Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Zero Resource Cross-Lingual Part Of Speech Tagging (2401.05727v1)

Published 11 Jan 2024 in cs.CL

Abstract: Part of speech tagging in zero-resource settings can be an effective approach for low-resource languages when no labeled training data is available. Existing systems use two main techniques for POS tagging i.e. pretrained multilingual LLMs(LLM) or project the source language labels into the zero resource target language and train a sequence labeling model on it. We explore the latter approach using the off-the-shelf alignment module and train a hidden Markov model(HMM) to predict the POS tags. We evaluate transfer learning setup with English as a source language and French, German, and Spanish as target languages for part-of-speech tagging. Our conclusion is that projected alignment data in zero-resource language can be beneficial to predict POS tags.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (12)
  1. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451, Online. Association for Computational Linguistics.
  2. Bert: Pre-training of deep bidirectional transformers for language understanding. ArXiv, abs/1810.04805.
  3. Cross-lingual synonymy overlap. In Proceedings of the International Conference Recent Advances in Natural Language Processing, pages 147–152, Hissar, Bulgaria. INCOMA Ltd. Shoumen, BULGARIA.
  4. A simple, fast, and effective reparameterization of ibm model 2. In North American Chapter of the Association for Computational Linguistics.
  5. Dependency grammar induction via bitext projection constraints. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pages 369–377, Suntec, Singapore. Association for Computational Linguistics.
  6. Model and data transfer for cross-lingual sequence labelling in zero-resource settings. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 6403–6416, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  7. SimAlign: High quality word alignments without parallel training data using static and contextualized embeddings. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1627–1643, Online. Association for Computational Linguistics.
  8. A universal part-of-speech tagset. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), pages 2089–2096, Istanbul, Turkey. European Language Resources Association (ELRA).
  9. Jörg Tiedemann and Santhosh Thottingal. 2020. OPUS-MT – building open translation services for the world. In Proceedings of the 22nd Annual Conference of the European Association for Machine Translation, pages 479–480, Lisboa, Portugal. European Association for Machine Translation.
  10. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355, Brussels, Belgium. Association for Computational Linguistics.
  11. Inducing multilingual text analysis tools via robust projection across aligned corpora. In Proceedings of the First International Conference on Human Language Technology Research.
  12. Daniel Zeman et al. 2021. Universal dependencies 2.8.1. LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (1)
  1. Sahil Chopra (2 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets