Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ParaFusion: A Large-Scale LLM-Driven English Paraphrase Dataset Infused with High-Quality Lexical and Syntactic Diversity (2404.12010v1)

Published 18 Apr 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Paraphrase generation is a pivotal task in NLP. Existing datasets in the domain lack syntactic and lexical diversity, resulting in paraphrases that closely resemble the source sentences. Moreover, these datasets often contain hate speech and noise, and may unintentionally include non-English language sentences. This research introduces ParaFusion, a large-scale, high-quality English paraphrase dataset developed using LLMs (LLM) to address these challenges. ParaFusion augments existing datasets with high-quality data, significantly enhancing both lexical and syntactic diversity while maintaining close semantic similarity. It also mitigates the presence of hate speech and reduces noise, ensuring a cleaner and more focused English dataset. Results show that ParaFusion offers at least a 25% improvement in both syntactic and lexical diversity, measured across several metrics for each data source. The paper also aims to set a gold standard for paraphrase evaluation as it contains one of the most comprehensive evaluation strategies to date. The results underscore the potential of ParaFusion as a valuable resource for improving NLP applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (28)
  1. K. R. McKeown, “Paraphrasing using given and new information in a question-answer system,” in Proceedings of the 17th annual meeting on Association for Computational Linguistics -.   La Jolla, California: Association for Computational Linguistics, 1979, p. 67. [Online]. Available: http://portal.acm.org/citation.cfm?doid=982163.982182
  2. M. Meteer and V. Shaked, “Strategies for effective paraphrasing,” in Proceedings of the 12th conference on Computational linguistics -, vol. 2.   Budapest, Hungry: Association for Computational Linguistics, 1988, pp. 431–436. [Online]. Available: http://portal.acm.org/citation.cfm?doid=991719.991724
  3. R. Kozlowski, K. F. McCoy, and K. Vijay-Shanker, “Generation of single-sentence paraphrases from predicate/argument structure using lexico-grammatical resources,” in Proceedings of the second international workshop on Paraphrasing -, vol. 16.   Sapporo, Japan: Association for Computational Linguistics, 2003, pp. 1–8. [Online]. Available: http://portal.acm.org/citation.cfm?doid=1118984.1118985
  4. I. A. Bolshakov and A. Gelbukh, “Synonymous Paraphrasing Using WordNet and Internet,” in Natural Language Processing and Information Systems, D. Hutchison, T. Kanade, J. Kittler, J. M. Kleinberg, F. Mattern, J. C. Mitchell, M. Naor, O. Nierstrasz, C. Pandu Rangan, B. Steffen, M. Sudan, D. Terzopoulos, D. Tygar, M. Y. Vardi, G. Weikum, F. Meziane, and E. Métais, Eds.   Berlin, Heidelberg: Springer Berlin Heidelberg, 2004, vol. 3136, pp. 312–323, series Title: Lecture Notes in Computer Science. [Online]. Available: http://link.springer.com/10.1007/978-3-540-27779-8_27
  5. J. Zhou and S. Bhat, “Paraphrase Generation: A Survey of the State of the Art,” in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, M.-F. Moens, X. Huang, L. Specia, and S. W.-t. Yih, Eds.   Online and Punta Cana, Dominican Republic: Association for Computational Linguistics, Nov. 2021, pp. 5075–5086. [Online]. Available: https://aclanthology.org/2021.emnlp-main.414
  6. OpenAI, “GPT-4 Technical Report,” 2023, publisher: arXiv Version Number: 3. [Online]. Available: https://arxiv.org/abs/2303.08774
  7. J. Ganitkevitch, B. Van Durme, and C. Callison-Burch, “PPDB: The Paraphrase Database,” in Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.   Atlanta, Georgia: Association for Computational Linguistics, Jun. 2013, pp. 758–764. [Online]. Available: https://aclanthology.org/N13-1092
  8. W. Lan, S. Qiu, H. He, and W. Xu, “A Continuously Growing Dataset of Sentential Paraphrases,” in Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing.   Copenhagen, Denmark: Association for Computational Linguistics, 2017, pp. 1224–1234. [Online]. Available: http://aclweb.org/anthology/D17-1126
  9. A. Fader, L. Zettlemoyer, and O. Etzioni, “Paraphrase-Driven Learning for Open Question Answering,” in Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).   Sofia, Bulgaria: Association for Computational Linguistics, Aug. 2013, pp. 1608–1618. [Online]. Available: https://aclanthology.org/P13-1158
  10. T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft COCO: Common Objects in Context,” in Computer Vision – ECCV 2014, D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, Eds.   Cham: Springer International Publishing, 2014, vol. 8693, pp. 740–755, series Title: Lecture Notes in Computer Science. [Online]. Available: http://link.springer.com/10.1007/978-3-319-10602-1_48
  11. W. B. Dolan and C. Brockett, “Automatically Constructing a Corpus of Sentential Paraphrases,” in Proceedings of the Third International Workshop on Paraphrasing (IWP2005), 2005. [Online]. Available: https://aclanthology.org/I05-5002
  12. S. Iyer, N. Dandeka, and K. Csernai, “First Quora Dataset Release: Question Pairs,” 2017. [Online]. Available: https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs
  13. J. Wieting and K. Gimpel, “ParaNMT-50M: Pushing the Limits of Paraphrastic Sentence Embeddings with Millions of Machine Translations,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).   Melbourne, Australia: Association for Computational Linguistics, 2018, pp. 451–462. [Online]. Available: http://aclweb.org/anthology/P18-1042
  14. J. E. Hu, R. Rudinger, M. Post, and B. Van Durme, “ParaBank: Monolingual Bitext Generation and Sentential Paraphrasing via Lexically-constrained Neural Machine Translation,” Jan. 2019, arXiv:1901.03644 [cs]. [Online]. Available: http://arxiv.org/abs/1901.03644
  15. J. E. Hu, A. Singh, N. Holzenberger, M. Post, and B. Van Durme, “Large-Scale, Diverse, Paraphrastic Bitexts via Sampling and Clustering,” in Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL).   Hong Kong, China: Association for Computational Linguistics, 2019, pp. 44–54. [Online]. Available: https://www.aclweb.org/anthology/K19-1005
  16. Y. Zhang, J. Baldridge, and L. He, “PAWS: Paraphrase Adversaries from Word Scrambling,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers).   Minneapolis, Minnesota: Association for Computational Linguistics, Jun. 2019, pp. 1298–1308. [Online]. Available: https://aclanthology.org/N19-1131
  17. OpenAI, “Moderation,” 2023. [Online]. Available: https://platform.openai.com/docs/guides/moderation
  18. OpenAI, “New and Improved Embedding Model,” 2023, publisher: OpenAI. [Online]. Available: https://openai.com/blog/new-and-improved-embedding-model
  19. T. Gao, X. Yao, and D. Chen, “SimCSE: Simple Contrastive Learning of Sentence Embeddings,” in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing.   Online and Punta Cana, Dominican Republic: Association for Computational Linguistics, Nov. 2021, pp. 6894–6910. [Online]. Available: https://aclanthology.org/2021.emnlp-main.552
  20. Y. Jiang, L. Zhang, and W. Wang, “Improved Universal Sentence Embeddings with Prompt-based Contrastive Learning and Energy-based Learning,” in Findings of the Association for Computational Linguistics: EMNLP 2022.   Abu Dhabi, United Arab Emirates: Association for Computational Linguistics, Dec. 2022, pp. 3021–3035. [Online]. Available: https://aclanthology.org/2022.findings-emnlp.220
  21. N. Reimers and I. Gurevych, “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing.   Association for Computational Linguistics, Nov. 2019. [Online]. Available: http://arxiv.org/abs/1908.10084
  22. P. Qi, Y. Zhang, Y. Zhang, J. Bolton, and C. D. Manning, “Stanza: A Python Natural Language Processing Toolkit for Many Human Languages,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 2020.
  23. M. Pawlik and N. Augsten, “Efficient Computation of the Tree Edit Distance,” ACM Transactions on Database Systems, vol. 40, no. 1, pp. 1–40, Mar. 2015. [Online]. Available: https://dl.acm.org/doi/10.1145/2699485
  24. F. M. Zanzotto, A. Santilli, L. Ranaldi, D. Onorati, P. Tommasino, and F. Fallucchi, “KERMIT: Complementing Transformer Architectures with Encoders of Explicit Syntactic Interpretations,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP).   Online: Association for Computational Linguistics, Nov. 2020, pp. 256–267. [Online]. Available: https://www.aclweb.org/anthology/2020.emnlp-main.18
  25. M. Post, “A Call for Clarity in Reporting BLEU Scores,” in Proceedings of the Third Conference on Machine Translation: Research Papers.   Belgium, Brussels: Association for Computational Linguistics, Oct. 2018, pp. 186–191. [Online]. Available: https://www.aclweb.org/anthology/W18-6319
  26. Google, “Google Research,” 2023. [Online]. Available: https://github.com/google-research/google-research/tree/master
  27. C. Van Der Lee, A. Gatt, E. Van Miltenburg, S. Wubben, and E. Krahmer, “Best practices for the human evaluation of automatically generated text,” in Proceedings of the 12th International Conference on Natural Language Generation.   Tokyo, Japan: Association for Computational Linguistics, 2019, pp. 355–368. [Online]. Available: https://www.aclweb.org/anthology/W19-8643
  28. Y. Liu, D. Iter, Y. Xu, S. Wang, R. Xu, and C. Zhu, “G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment,” 2023, publisher: arXiv Version Number: 3. [Online]. Available: https://arxiv.org/abs/2303.16634

Summary

We haven't generated a summary for this paper yet.