NLU-STR at SemEval-2024 Task 1: Generative-based Augmentation and Encoder-based Scoring for Semantic Textual Relatedness (2405.00659v1)
Abstract: Semantic textual relatedness is a broader concept of semantic similarity. It measures the extent to which two chunks of text convey similar meaning or topics, or share related concepts or contexts. This notion of relatedness can be applied in various applications, such as document clustering and summarizing. SemRel-2024, a shared task in SemEval-2024, aims at reducing the gap in the semantic relatedness task by providing datasets for fourteen languages and dialects including Arabic. This paper reports on our participation in Track A (Algerian and Moroccan dialects) and Track B (Modern Standard Arabic). A BERT-based model is augmented and fine-tuned for regression scoring in supervised track (A), while BERT-based cosine similarity is employed for unsupervised track (B). Our system ranked 1st in SemRel-2024 for MSA with a Spearman correlation score of 0.49. We ranked 5th for Moroccan and 12th for Algerian with scores of 0.83 and 0.53, respectively.
- What makes sentences semantically related? A textual relatedness dataset and empirical study. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2023, Dubrovnik, Croatia, May 2-6, 2023, pages 782–796. Association for Computational Linguistics.
- ARBERT & MARBERT: deep bidirectional transformers for arabic. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, pages 7088–7105. Association for Computational Linguistics.
- Moustafa Al-Hajj and Mustafa Jarrar. 2021a. Arabglossbert: Fine-tuning bert on context-gloss pairs for wsd. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), pages 40–48, Online. INCOMA Ltd.
- Moustafa Al-Hajj and Mustafa Jarrar. 2021b. Lu-bzu at semeval-2021 task 2: Word2vec and lemma2vec performance in arabic word-in-context disambiguation. In Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021), pages 748–755, Online. Association for Computational Linguistics.
- Usability evaluation of lexicographic e-services. In The 2019 IEEE/ACS 16th International Conference on Computer Systems and Applications (AICCSA), pages 1–7. IEE.
- Arabert: Transformer-based model for arabic language understanding. In Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection, pages 9–15.
- Big bird: A large, fine-grained, bigram relatedness dataset for examining semantic composition. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 505–516. Association for Computational Linguistics.
- Alejandro Fuster Baggetto and Víctor Fresno. 2022. Is anisotropy really the cause of BERT embeddings not being semantic? In Findings of the Association for Computational Linguistics: EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pages 4271–4281. Association for Computational Linguistics.
- SICK through the semeval glasses. lesson learned from the evaluation of compositional distributional semantic models on full sentences through semantic relatedness and textual entailment. Lang. Resour. Evaluation, 50(1):95–124.
- Recent trends in word sense disambiguation: A survey. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21, pages 4330–4338.
- An approach to measuring semantic relatedness of geographic terminologies using a thesaurus and lexical database sources. ISPRS Int. J. Geo Inf., 7(3):98.
- A panoramic survey of natural language processing in the arab worlds. Commun. ACM, 64(4):72–81.
- Language-agnostic BERT sentence embedding. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 878–891. Association for Computational Linguistics.
- A benchmark and scoring algorithm for enriching arabic synonyms. In Proceedings of the 12th International Global Wordnet Conference (GWC2023), pages 215–222. Global Wordnet Association.
- Reltopic: A graph-based semantic relatedness measure in topic ontologies and its applicability for topic labeling of old press articles. Semantic Web, 14(2):293–321.
- Learning word vectors for 157 languages. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation, LREC 2018, Miyazaki, Japan, May 7-12, 2018. European Language Resources Association (ELRA).
- Curras + baladi: Towards a levantine corpus. In Proceedings of the International Conference on Language Resources and Evaluation (LREC 2022), Marseille, France.
- Michael Alexander Kirkwood Halliday and Ruqaiya Hasan. 2014. Cohesion in english. 9. Routledge.
- Knowledge-based semantic relatedness measure using semantic features. International Journal, 9(2).
- Mustafa Jarrar. 2011. Building a formal arabic ontology (invited paper). In Proceedings of the Experts Meeting on Arabic Ontologies and Semantic Networks. ALECSO, Arab League.
- Mustafa Jarrar. 2021. The arabic ontology - an arabic wordnet with ontologically clean content. Applied Ontology Journal, 16(1):1–26.
- Mustafa Jarrar and Hamzeh Amayreh. 2019. An arabic-multilingual database with a lexicographic search engine. In The 24th International Conference on Applications of Natural Language to Information Systems (NLDB 2019), volume 11608 of LNCS, pages 234–246. Springer.
- Arbanking77: Intent detection neural model and a new dataset in modern and dialectical arabic. In Proceedings of the 1st Arabic Natural Language Processing Conference (ArabicNLP), Part of the EMNLP 2023, pages 276–287. ACL.
- Curras: An annotated corpus for the palestinian arabic dialect. Journal Language Resources and Evaluation, 51(3):745–775.
- Wojood: Nested arabic named entity corpus and recognition using bert. In Proceedings of the International Conference on Language Resources and Evaluation (LREC 2022), Marseille, France.
- Salma: Arabic sense-annotated corpus and wsd benchmarks. In Proceedings of the 1st Arabic Natural Language Processing Conference (ArabicNLP), Part of the EMNLP 2023, pages 359–369. ACL.
- Lisan: Yemeni, irqi, libyan, and sudanese arabic dialect copora with morphological annotations. In The 20th IEEE/ACS International Conference on Computer Systems and Applications (AICCSA). IEEE.
- Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186.
- Svetlana Kiritchenko and Saif M. Mohammad. 2017. Best-worst scaling more reliable than rating scales: A case study on sentiment intensity annotation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 2: Short Papers, pages 465–470. Association for Computational Linguistics.
- Enhancing accuracy of semantic relatedness measurement by word single-meaning embeddings. IEEE Access, 9:117424–117433.
- Guan-Ting Lin and Manuel Giambi. 2021. Context-gloss augmentation for improving word sense disambiguation. arXiv preprint arXiv:2110.07174, abs/2110.07174.
- Arabic fine-grained entity recognition. In Proceedings of the 1st Arabic Natural Language Processing Conference (ArabicNLP), Part of the EMNLP 2023, pages 310–323. ACL.
- Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692.
- Context-gloss augmentation for improving arabic target sense verification. In Proceedings of the 12th Global Wordnet Conference, GWC 2023, University of the Basque Country, Donostia - San Sebastian, Basque Country, Spain, 23 - 27 January 2023, pages 254–262. Global Wordnet Association.
- Chenggang Mi and Shaoliang Xie. 2024. Language relatedness evaluation for multilingual neural machine translation. Neurocomputing, 570:127115.
- Efficient estimation of word representations in vector space. In 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, May 2-4, 2013, Workshop Track Proceedings.
- Jeremy Miles. 2005. R-squared, adjusted r-squared. Encyclopedia of statistics in behavioral science.
- George A Miller and Walter G Charles. 1991. Contextual correlates of semantic similarity. Language and cognitive processes, 6(1):1–28.
- Rethinking the role of demonstrations: What makes in-context learning work? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pages 11048–11064. Association for Computational Linguistics.
- A comprehensive survey on word representation models: From classical to state-of-the-art word representation language models. ACM Trans. Asian Low Resour. Lang. Inf. Process., 20(5):74:1–74:35.
- Nâbra: Syrian arabic dialects with morphological annotations. In Proceedings of the 1st Arabic Natural Language Processing Conference (ArabicNLP), Part of the EMNLP 2023, pages 12–23. ACL.
- Semrel2024: A collection of semantic textual relatedness datasets for 14 languages.
- SemEval-2024 task 1: Semantic textual relatedness. In Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024).
- Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pages 1532–1543. ACL.
- Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pages 3980–3990. Association for Computational Linguistics.
- Semantic relatedness based re-ranker for text spotting. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pages 3449–3455. Association for Computational Linguistics.
- GPT-2 contextual data augmentation for word sense disambiguation. In Proceedings of the 36th Pacific Asia Conference on Language, Information and Computation, PACLIC 2022, Manila, Philippines, October 20-22, 2022, pages 455–462. De La Salle University.
- Text relatedness based on a word thesaurus. CoRR, abs/1401.5699.
- Literal and metaphorical sense identification through concrete and abstract context. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, EMNLP 2011, 27-31 July 2011, John McIntyre Conference Centre, Edinburgh, UK, A meeting of SIGDAT, a Special Interest Group of the ACL, pages 680–690. ACL.
- Linrui Zhang and Dan Moldovan. 2019. Multi-task learning for semantic relatedness and textual entailment. Journal of Software Engineering and Applications, 12(6):199–214.
- AP-BERT: enhanced pre-trained model through average pooling. Appl. Intell., 52(14):15929–15937.
- Sanad Malaysha (5 papers)
- Mustafa Jarrar (34 papers)
- Mohammed Khalilia (17 papers)