Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

PAWS: Paraphrase Adversaries from Word Scrambling (1904.01130v1)

Published 1 Apr 2019 in cs.CL

Abstract: Existing paraphrase identification datasets lack sentence pairs that have high lexical overlap without being paraphrases. Models trained on such data fail to distinguish pairs like flights from New York to Florida and flights from Florida to New York. This paper introduces PAWS (Paraphrase Adversaries from Word Scrambling), a new dataset with 108,463 well-formed paraphrase and non-paraphrase pairs with high lexical overlap. Challenging pairs are generated by controlled word swapping and back translation, followed by fluency and paraphrase judgments by human raters. State-of-the-art models trained on existing datasets have dismal performance on PAWS (<40% accuracy); however, including PAWS training data for these models improves their accuracy to 85% while maintaining performance on existing tasks. In contrast, models that do not capture non-local contextual information fail even with PAWS training examples. As such, PAWS provides an effective instrument for driving further progress on models that better exploit structure, context, and pairwise comparisons.

Citations (508)

Summary

  • The paper introduces PAWS, a novel dataset presenting adversarial examples with high lexical overlap to challenge paraphrase identification models.
  • It uses controlled word swapping and back translation to generate over 108,000 sentence pairs that expose weaknesses in models like BERT.
  • Integrating PAWS into training significantly boosts performance, underlining the importance of capturing complex word interactions and structure.

An Overview of PAWS: Paraphrase Adversaries from Word Scrambling

The paper introduces PAWS (Paraphrase Adversaries from Word Scrambling), a novel dataset designed to challenge and improve paraphrase identification models. The key contribution lies in addressing a significant gap in existing datasets, which lack sentence pairs with high lexical overlap that are not paraphrases, such as "flights from New York to Florida" versus "flights from Florida to New York".

Dataset Construction and Methodology

PAWS comprises 108,463 well-formed paraphrase and non-paraphrase pairs. These are characterized by high lexical overlap and non-trivial reordering of words. The dataset is created using a novel method that combines controlled word swapping and back translation. Human raters then assess the fluency and paraphrase status of these pairs. The controlled word swapping generates adversarial examples by permuting word order and leveraging a LLM to ensure grammaticality. Back translation provides paraphrases with high lexical overlap but varied word order.

Two datasets are constructed: one using Quora Question Pairs (QQP) and the other using Wikipedia text. The inclusion of examples with balanced class labels ensures a robust testing ground for models, highlighting their ability to understand and utilize word order and structure.

Experimental Insights

The paper demonstrates that state-of-the-art models such as BERT, when trained exclusively on existing resources, perform poorly on the PAWS dataset, with accuracy dropping to below 40%. However, incorporating PAWS into the training regimen significantly enhances model performance, with BERT's accuracy boosting to 85% on challenging PAWS pairs while maintaining high performance on traditional datasets.

Another notable finding is the differential performance across model complexities. Models like DIIN show remarkable improvements compared to simpler models like Bag-of-Words (BOW). The research highlights that models capturing non-local contextual information and complex word interactions perform best when trained with PAWS data.

Practical and Theoretical Implications

Practically, PAWS serves as a valuable tool for augmenting and testing paraphrase identification models, ensuring they can accurately discern meaning despite high lexical overlap. This is particularly relevant for applications in machine translation, search engines, and natural language understanding systems where nuanced comprehension is critical.

Theoretically, the dataset fosters research into models that can better exploit syntactic structures and emphasize how adversarial datasets can expose and address model limitations. The findings emphasize the need for continued developments in representation learning that are sensitive to both lexical semantics and sentence structure.

Future Directions

Looking ahead, the research community could explore integrating PAWS into larger benchmark suites to further test model robustness across diverse linguistic phenomena. Additionally, exploring other languages and cultural contexts could enhance the generalizability of these findings. The creation of similar datasets across different linguistic constructs and domains remains an open area for research, offering opportunities to further refine LLMs in AI.

In conclusion, PAWS presents a significant advancement in the quest for enhanced paraphrase identification, offering a challenging and insightful resource for the improvement of current and future models.