Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

BenCoref: A Multi-Domain Dataset of Nominal Phrases and Pronominal Reference Annotations (2304.03682v3)

Published 7 Apr 2023 in cs.CL

Abstract: Coreference Resolution is a well studied problem in NLP. While widely studied for English and other resource-rich languages, research on coreference resolution in Bengali largely remains unexplored due to the absence of relevant datasets. Bengali, being a low-resource language, exhibits greater morphological richness compared to English. In this article, we introduce a new dataset, BenCoref, comprising coreference annotations for Bengali texts gathered from four distinct domains. This relatively small dataset contains 5200 mention annotations forming 502 mention clusters within 48,569 tokens. We describe the process of creating this dataset and report performance of multiple models trained using BenCoref. We expect that our work provides some valuable insights on the variations in coreference phenomena across several domains in Bengali and encourages the development of additional resources for Bengali. Furthermore, we found poor crosslingual performance at zero-shot setting from English, highlighting the need for more language-specific resources for this task.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (22)
  1. 2007. Ontonotes English Coreference Guidelines.
  2. An annotated dataset of coreference in English literature. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 44–54, Marseille, France. European Language Resources Association.
  3. A Bayesian mixed effects model of literary character. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 370–379, Baltimore, Maryland. Association for Computational Linguistics.
  4. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  5. The automatic content extraction (ACE) program – tasks, data, and evaluation. In Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04), Lisbon, Portugal. European Language Resources Association (ELRA).
  6. A web-based tool for the integrated annotation of semantic and syntactic structures. In Proceedings of the Workshop on Language Technology Resources and Tools for Digital Humanities (LT4DH), pages 76–84, Osaka, Japan. The COLING 2016 Organizing Committee.
  7. Abbas Ghaddar and Philippe Langlais. 2016. Wikicoref: An english coreference-annotated corpus of wikipedia articles. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 136–142.
  8. Learning word vectors for 157 languages. In Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018).
  9. Ralph Grishman and Beth M Sundheim. 1996. Message understanding conference-6: A brief history. In COLING 1996 Volume 1: The 16th International Conference on Computational Linguistics.
  10. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Lstm can solve hard long time lag problems. Advances in neural information processing systems, pages 473–479.
  11. Sirajul Islam et al. 2003. Banglapedia. National Encyclopedia, Asiatic Society of Bangladesh, Dhaka.
  12. Higher-order coreference resolution with coarse-to-fine inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 687–692, New Orleans, Louisiana. Association for Computational Linguistics.
  13. Lesly Miculicich Werlen and Andrei Popescu-Belis. 2017. Using coreference links to improve spanish-to-english machine translation. Technical report, Idiap.
  14. Thomas S Morton. 1999. Using coreference for question answering. In Coreference and Its Applications.
  15. Michal Novák and Zdeněk Žabokrtskỳ. 2014. Cross-lingual coreference resolution of pronouns. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, pages 14–24.
  16. Context-aware neural machine translation with coreference information. In Proceedings of the Fourth Workshop on Discourse in Machine Translation (DiscoMT 2019), pages 45–50.
  17. Conll-2012 shared task: Modeling multilingual unrestricted coreference in ontonotes. In Joint Conference on EMNLP and CoNLL-Shared Task, pages 1–40.
  18. Apurbalal Senapati and Utpal Garain. 2013. Guitar-based pronominal anaphora resolution in bengali. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 126–130.
  19. Adapting a state-of-the-art anaphora resolution system for resource-poor language. In Proceedings of the Sixth International Joint Conference on Natural Language Processing, pages 815–821.
  20. Differential evolution-based feature selection technique for anaphora resolution. Soft Computing, 19(8):2149–2161.
  21. Two uses of anaphora resolution in summarization. Information Processing & Management, 43(6):1663–1680.
  22. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Shadman Rohan (2 papers)
  2. Mojammel Hossain (2 papers)
  3. Mohammad Mamun Or Rashid (6 papers)
  4. Nabeel Mohammed (27 papers)
Citations (1)

Summary

We haven't generated a summary for this paper yet.