SemRel2024: A Collection of Semantic Textual Relatedness Datasets for 13 Languages (2402.08638v5)

Published 13 Feb 2024 in cs.CL

Abstract: Exploring and quantifying semantic relatedness is central to representing language and holds significant implications across various NLP tasks. While earlier NLP research primarily focused on semantic similarity, often within the English language context, we instead investigate the broader phenomenon of semantic relatedness. In this paper, we present \textit{SemRel}, a new semantic relatedness dataset collection annotated by native speakers across 13 languages: \textit{Afrikaans, Algerian Arabic, Amharic, English, Hausa, Hindi, Indonesian, Kinyarwanda, Marathi, Moroccan Arabic, Modern Standard Arabic, Spanish,} and \textit{Telugu}. These languages originate from five distinct language families and are predominantly spoken in Africa and Asia -- regions characterised by a relatively limited availability of NLP resources. Each instance in the SemRel datasets is a sentence pair associated with a score that represents the degree of semantic textual relatedness between the two sentences. The scores are obtained using a comparative annotation framework. We describe the data collection and annotation processes, challenges when building the datasets, baseline experiments, and their impact and utility in NLP.

PDF Abstract

Overview of "SemRel2024: A Collection of Semantic Textual Relatedness Datasets for 14 Languages"

The presented paper entitled "SemRel2024: A Collection of Semantic Textual Relatedness Datasets for 14 Languages" introduces a new dataset collection designated for evaluating Semantic Textual Relatedness (STR) across a selection of 14 languages. These languages are primarily from Africa and Asia, and include both high-resource and low-resource languages: Afrikaans, Algerian Arabic, Amharic, English, Hausa, Hindi, Indonesian, Kinyarwanda, Marathi, Moroccan Arabic, Modern Standard Arabic, Punjabi, Spanish, and Telugu. This multilingual dataset aims to advance NLP resources for low-resource languages by facilitating a broader understanding of semantic relatedness beyond mere similarity, considering differences in context such as entailment, topical alignment, and temporal congruence.

Dataset Collection and Composition

The SemRel dataset stands out because it is curated specifically for monolingual STR tasks across five distinct language families. It includes a variety of sentence pairings, each annotated with a relatedness score ranging from 0 to 1. The annotation was conducted using a Best-Worst Scaling (BWS) approach among native speakers, ensuring that the derived scores typically align with human intuition about semantic relatedness within each language.

The datasets were compiled from diverse publicly available text sources, reflective of the domains like news articles, conversational threads, and other forms of informal and formal writings. Each language dataset was carefully curated to capture diverse linguistic idiosyncrasies, using a combination of lexical overlap, entailment, paraphrase generation, and random sampling methods to ensure rich variance in relatedness scores.

Methodology and Challenges

The paper outlines methodological insights into the dataset preparation, discussing the linguistic and technical challenges intrinsic to working with low-resource languages. These include limited availability of textual datasets and the complexity of ensuring balanced representation across relatedness levels without relying disproportionately on high-resource contextual models, which might not adequately represent less-resourced languages.

The annotations ensure high reliability by employing BWS over traditional Likert-like scales to mitigate reliability and consistency issues, thus achieving robust measurement of semantic relatedness. The split-half reliability scores indicate consistent annotator agreement on instance rankings, reinforcing the dataset's reliability.

Evaluation and Results

The paper reports on a series of STR model experiments, both in supervised and unsupervised settings using multilingual models such as mBERT and XLM-RoBERTa, alongside language-specific models. Results reveal that while multilingual BERT-based models provide reasonable STR performance, there are notable discrepancies in their efficacy across different linguistic corpora, likely tied to inherent biases within the pre-trained models toward high-resource languages.

When language-specific models were employed, they generally outperformed their general multilingual counterparts—signaling the continued importance of focused language-specific model development. However, the paper acknowledges the variability in correlation scores across languages, especially highlighting the challenges in STR tasks for languages like Punjabi, indicating room for further research.

Implications and Future Directions

The SemRel datasets serve as a cornerstone for expanded multilingual and cross-lingual NLP research, particularly in developing effective STR models that can enhance various downstream tasks such as sentiment analysis, machine translation, and dialogue systems. The initiative underscores the need for NLP advancements suited to low-resource environments and provides foundational data necessary to train models capable of accurately interpreting diverse semantic nuances across languages.

The authors suggest future dataset expansions and encourage the NLP community to utilize and build upon the SemRel datasets for expanded STR understanding and applications. Moreover, future investigations might consider leveraging emerging architectures that can dynamically adapt to low-resource scenarios, potentially improving LLM efficacy in capturing cross-lingual semantic dynamics even further.