Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ParaAMR: A Large-Scale Syntactically Diverse Paraphrase Dataset by AMR Back-Translation (2305.16585v1)

Published 26 May 2023 in cs.CL

Abstract: Paraphrase generation is a long-standing task in NLP. Supervised paraphrase generation models, which rely on human-annotated paraphrase pairs, are cost-inefficient and hard to scale up. On the other hand, automatically annotated paraphrase pairs (e.g., by machine back-translation), usually suffer from the lack of syntactic diversity -- the generated paraphrase sentences are very similar to the source sentences in terms of syntax. In this work, we present ParaAMR, a large-scale syntactically diverse paraphrase dataset created by abstract meaning representation back-translation. Our quantitative analysis, qualitative examples, and human evaluation demonstrate that the paraphrases of ParaAMR are syntactically more diverse compared to existing large-scale paraphrase datasets while preserving good semantic similarity. In addition, we show that ParaAMR can be used to improve on three NLP tasks: learning sentence embeddings, syntactically controlled paraphrase generation, and data augmentation for few-shot learning. Our results thus showcase the potential of ParaAMR for improving various NLP applications.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Kuan-Hao Huang (33 papers)
  2. Varun Iyer (5 papers)
  3. I-Hung Hsu (21 papers)
  4. Anoop Kumar (15 papers)
  5. Kai-Wei Chang (292 papers)
  6. Aram Galstyan (142 papers)
Citations (7)