FarsTail: A Persian Natural Language Inference Dataset (2009.08820v2)

Published 18 Sep 2020 in cs.CL

Abstract: Natural language inference (NLI) is known as one of the central tasks in NLP which encapsulates many fundamental aspects of language understanding. With the considerable achievements of data-hungry deep learning methods in NLP tasks, a great amount of effort has been devoted to develop more diverse datasets for different languages. In this paper, we present a new dataset for the NLI task in the Persian language, also known as Farsi, which is one of the dominant languages in the Middle East. This dataset, named FarsTail, includes 10,367 samples which are provided in both the Persian language as well as the indexed format to be useful for non-Persian researchers. The samples are generated from 3,539 multiple-choice questions with the least amount of annotator interventions in a way similar to the SciTail dataset. A carefully designed multi-step process is adopted to ensure the quality of the dataset. We also present the results of traditional and state-of-the-art methods on FarsTail including different embedding methods such as word2vec, fastText, ELMo, BERT, and LASER, as well as different modeling approaches such as DecompAtt, ESIM, HBMP, and ULMFiT to provide a solid baseline for the future research. The best obtained test accuracy is 83.38% which shows that there is a big room for improving the current methods to be useful for real-world NLP applications in different languages. We also investigate the extent to which the models exploit superficial clues, also known as dataset biases, in FarsTail, and partition the test set into easy and hard subsets according to the success of biased models. The dataset is available at https://github.com/dml-qom/FarsTail

PDF Abstract

An Analysis of FarsTail: A Persian Natural Language Inference Dataset

The paper introduces FarsTail, a Persian language natural language inference (NLI) dataset, which constitutes a significant contribution to the limited pool of NLI resources available for languages other than English. FarsTail is specifically designed to address the growing interest in applying deep learning models to language understanding tasks in Persian, a language spoken by millions but underrepresented in NLP datasets.

Dataset Composition and Methodology

FarsTail comprises 10,367 samples, each involving a premise and a hypothesis sentence whose relationship is annotated as one of entailment, contradiction, or neutral. The data collection process drew inspiration from the SciTail dataset, focusing on naturally occurring data with minimal human intervention. Sentences were carefully curated to ensure they are representative of real-world applications. The dataset was generated from a base of 3,539 multiple-choice questions sourced from Iranian university exams. Each premise was extracted from web content to correspond to given hypotheses formed by substituting question answers, whereas the hypothesis was generated to reflect various logical relationships with the premise.

Evaluation and Baseline Models

The paper details an evaluation of various baseline models on the FarsTail dataset, including traditional machine learning approaches and state-of-the-art deep learning models. Representation methods such as TF-IDF, word2vec, fastText, ELMo, and BERT were tested, along with classification models like SVM, LSTM, and GRU. Significantly, the BERT-based models, specifically ParsBERT and mBERT, achieved the highest accuracy of 83.38% on the test set, underscoring their effectiveness for NLI tasks in Persian and suggesting potential areas for enhancements.

Implications and Future Work

The development of FarsTail has several practical implications. Firstly, it sets a foundation for the development and benchmarking of NLP tools and models tailored for the Persian language, similar to initiatives in English and other widely used languages. Additionally, the authors suggest uses for FarsTail in auxiliary NLP tasks such as question answering, summarization, and machine translation, apart from serving the primary NLI task. The relatively large-scale of the dataset and its availability in both raw and indexed formats make it a versatile resource for researchers.

Consideration of Dataset Bias

The paper also examines dataset biases, an area of concern in NLI datasets where superficial patterns can lead models to exploit non-generalizable correlations. By analyzing the point-wise mutual information (PMI) to identify word-class associations, the authors provide insights into the potential biases inherent in FarsTail. This analysis, coupled with the creation of hard and easy test subsets, reaffirms the necessity of recognizing and mitigating bias to enhance model robustness and effectiveness across diverse contexts.

Conclusion

FarsTail represents a significant step forward in NLP for the Persian language, offering a substantial resource for future research and development. As with any new dataset, there are ample opportunities to enhance baseline performance through advanced modeling techniques and to explore cross-linguistic applications. The availability of FarsTail addresses the need for comprehensive linguistic datasets, promoting better computational LLMs and facilitating more inclusive AI research. Future directions may include the creation of out-of-distribution challenge sets and further investigation into cross-lingual transfer capabilities.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Hossein Amirkhani (8 papers)
Mohammad AzariJafari (1 paper)
Zohreh Pourjafari (1 paper)
Soroush Faridan-Jahromi (1 paper)
Zeinab Kouhkan (1 paper)
Azadeh Amirak (1 paper)

Citations (27)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - dml-qom/FarsTail: FarsTail: a Persian natural language inference dataset (76 stars)