PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification (1908.11828v1)

Published 30 Aug 2019 in cs.CL

Abstract: Most existing work on adversarial data generation focuses on English. For example, PAWS (Paraphrase Adversaries from Word Scrambling) consists of challenging English paraphrase identification pairs from Wikipedia and Quora. We remedy this gap with PAWS-X, a new dataset of 23,659 human translated PAWS evaluation pairs in six typologically distinct languages: French, Spanish, German, Chinese, Japanese, and Korean. We provide baseline numbers for three models with different capacity to capture non-local context and sentence structure, and using different multilingual training and evaluation regimes. Multilingual BERT fine-tuned on PAWS English plus machine-translated data performs the best, with a range of 83.1-90.8 accuracy across the non-English languages and an average accuracy gain of 23% over the next best model. PAWS-X shows the effectiveness of deep, multilingual pre-training while also leaving considerable headroom as a new challenge to drive multilingual research that better captures structure and contextual information.

Authors (4)

Yinfei Yang (73 papers)
Yuan Zhang (331 papers)
Chris Tar (8 papers)
Jason Baldridge (45 papers)

Citations (341)

View on Semantic Scholar

Summary

PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification

The paper introduces PAWS-X, a multilingual extension of the PAWS (Paraphrase Adversaries from Word Scrambling) dataset, specifically designed to confront the challenges of paraphrase identification in multiple languages. Addressing the limitation of existing adversarial datasets that were predominantly focused on English, PAWS-X represents a significant stride by incorporating six typologically distinct languages: French, Spanish, German, Chinese, Japanese, and Korean.

Overview of PAWS-X

PAWS-X comprises 23,659 human-translated evaluation sentence pairs, which are derived from English examples of paraphrase pairs with significant lexical overlap but divergent semantic meanings. This characteristic is pivotal for testing a model's ability to discern paraphrases through nuanced comprehension of sentence structure and contextual cues. The dataset challenges models to distinguish semantic equivalency despite high superficial similarity, thus contributing a robust resource for evaluating multilingual text understanding capabilities.

Model Performance

The authors evaluated three baseline models, each with varying capacities to model sentence structure and capture context: BOW (Bag-of-Words with cosine similarity), ESIM (Enhanced Sequential Inference Model), and the state-of-the-art pre-trained Multilingual BERT model. The results show that Multilingual BERT, when fine-tuned with PAWS-X, performs distinctly better, achieving an accuracy range of 83.1-90.8 across the evaluated languages. Notably, this translates to a significant 23% average accuracy gain over the ESIM model, highlighting the efficacy of deep, context-aware multilingual pre-training over simpler architectures.

Cross-lingual Approaches

The paper investigates multiple cross-lingual training strategies, such as zero-shot learning and machine translation-assisted training. The "Merged" approach, where the model is trained on both English and machine-translated data, yields the highest performance, providing an 8.6% accuracy improvement over zero-shot learning. This result underscores the utility of incorporating multilingual data in training to enhance model generalization across languages.

Language-Specific Findings and Errors

Performance discrepancies were observed across different languages. Models generally performed better on Indo-European languages (e.g., German, French, Spanish) compared to CJK group languages (Chinese, Japanese, Korean). This can be attributed to both the quality of machine translation and the typological divergence of CJK languages from English. Error analysis indicated that examples failed consistently across all language pairs often due to labeling errors in the dataset or sentence structure complexities that were inadequately handled by the model.

Practical and Theoretical Implications

PAWS-X establishes a new benchmark not only for paraphrase identification but also for evaluating general multilingual natural language processing capabilities. From a practical standpoint, it challenges the current limitations of NLP models, encouraging developments that improve the robustness and cross-lingual adaptability of these technologies. The dataset provides a valuable resource for researchers striving to augment models with better structural understanding and context sensitivity, essential for advanced multilingual processing. Theoretical advancements based on PAWS-X can lead to models that better understand language universals, yielding broader applications and improved accuracy in multilingual tasks.

Future Directions

Future research could explore advanced domain adaptation techniques or the integration of more sophisticated linguistic features to improve model performance, particularly on typologically diverse languages. Additionally, enhancing machine translation methodologies and incorporating more nuanced linguistic features could mitigate entity translation inconsistencies, enhancing cross-LLM effectiveness.

In conclusion, PAWS-X significantly contributes to multi-language adversarial NLP datasets, serving as a catalyst for advancing research in paraphrase detection and broader multilingual model applications.

PDF Markdown