A Continuously Growing Dataset of Sentential Paraphrases (1708.00391v1)

Published 1 Aug 2017 in cs.CL

Abstract: A major challenge in paraphrase research is the lack of parallel corpora. In this paper, we present a new method to collect large-scale sentential paraphrases from Twitter by linking tweets through shared URLs. The main advantage of our method is its simplicity, as it gets rid of the classifier or human in the loop needed to select data before annotation and subsequent application of paraphrase identification algorithms in the previous work. We present the largest human-labeled paraphrase corpus to date of 51,524 sentence pairs and the first cross-domain benchmarking for automatic paraphrase identification. In addition, we show that more than 30,000 new sentential paraphrases can be easily and continuously captured every month at ~70% precision, and demonstrate their utility for downstream NLP tasks through phrasal paraphrase extraction. We make our code and data freely available.

Citations (161)

View on Semantic Scholar

Summary

The paper introduces a novel methodology to build a large, continuously growing dataset of sentential paraphrases by linking tweets that share the same URLs.
The constructed corpus comprises 51,524 human-labeled sentence pairs, the largest to date, with potential for approximately 30,000 new paraphrases monthly at circa 70% precision.
This scalable dataset facilitates advancements in paraphrase identification algorithms, various NLP applications, and future research directions, including multilingual analysis.

A Continuously Growing Dataset of Sentential Paraphrases

The paper, authored by Lan et al., addresses a significant challenge in the field of paraphrase research—specifically, the scarcity of parallel corpora for sentential paraphrases. The researchers introduce an innovative methodology to amass large-scale sentential paraphrases from Twitter, significantly expanding the paraphrase resource pool. This work presents an opportunity for both paraphrase identification algorithms and various downstream applications in NLP to benefit from this abundant and continuously updated corpus.

Methodology and Dataset Construction

The primary advancement of this paper lies in the novel approach of harvesting paraphrases from Twitter by linking tweets through shared URLs. This method circumvents the need for classifier-biased or human-in-the-loop data selection, which has been characteristic of previous paraphrase corpora, such as MSRP and PIT-2015. By leveraging URLs common to multiple tweets, researchers gathered a substantial corpus comprising 51,524 sentence pairs—a notable achievement as the largest human-labeled paraphrase corpus developed to date.

The paper details the process of constructing this corpus, exemplified by filtering for quality and reducing redundancy, particularly addressing automatic and manual retweets. A critical aspect of the corpus is its continuous growth potential, whereby approximately 30,000 new sentential paraphrases can be identified monthly with a precision rate of circa 70%. This progression underscores the paper’s significant contribution to creating an ever-expanding paraphrase resource.

Empirical Evaluation

The paper provides a comprehensive empirical evaluation comparing the new corpus against existing datasets, MSRP and PIT-2015, across different dimensions. The comparison reveals notable distinctions in paraphrase phenomena—such as elaboration, phrasal, and anaphora—which are prevalent in the Twitter URL corpus but not as pronounced in MSRP or PIT-2015.

Moreover, the benchmarking of several automatic paraphrase identification models demonstrates their effectiveness across these datasets. Models like DeepPairwiseWord exhibit robust performance due to their advanced mechanisms catering to lexically divergent paraphrases, prevalent in the Twitter data.

Implications and Future Directions

The implications of this research are significant for the NLP community. The ability to continuously harvest large quantities of sentential paraphrases opens pathways for advancing paraphrase algorithms and applications dependent on semantic similarity assessments. Furthermore, Lan et al. propose extracting phrasal paraphrases through word alignment processes facilitated by their dataset, showing promising results indicative of high-quality paraphrase generation.

The paper sets the stage for future research endeavors, including expansion into multiple languages beyond English and the exploration of language-independent paraphrase identification frameworks. The methodology’s applicability across languages present on social media platforms can revolutionize multilingual paraphrase research and applications.

Conclusion

Lan et al.'s work overcomes traditional obstacles associated with paraphrase research by presenting a scalable and effective strategy for dataset enrichment. Their methodology significantly contributes to the field by providing not only a larger corpus but also ensuring its sustained growth. This foundational work paves the way for enhanced paraphrase identification methodologies and more nuanced applications in the broader NLP field.