ParaFusion: A Large-Scale LLM-Driven English Paraphrase Dataset Infused with High-Quality Lexical and Syntactic Diversity (2404.12010v1)
Abstract: Paraphrase generation is a pivotal task in NLP. Existing datasets in the domain lack syntactic and lexical diversity, resulting in paraphrases that closely resemble the source sentences. Moreover, these datasets often contain hate speech and noise, and may unintentionally include non-English language sentences. This research introduces ParaFusion, a large-scale, high-quality English paraphrase dataset developed using LLMs (LLM) to address these challenges. ParaFusion augments existing datasets with high-quality data, significantly enhancing both lexical and syntactic diversity while maintaining close semantic similarity. It also mitigates the presence of hate speech and reduces noise, ensuring a cleaner and more focused English dataset. Results show that ParaFusion offers at least a 25% improvement in both syntactic and lexical diversity, measured across several metrics for each data source. The paper also aims to set a gold standard for paraphrase evaluation as it contains one of the most comprehensive evaluation strategies to date. The results underscore the potential of ParaFusion as a valuable resource for improving NLP applications.
- K. R. McKeown, “Paraphrasing using given and new information in a question-answer system,” in Proceedings of the 17th annual meeting on Association for Computational Linguistics -. La Jolla, California: Association for Computational Linguistics, 1979, p. 67. [Online]. Available: http://portal.acm.org/citation.cfm?doid=982163.982182
- M. Meteer and V. Shaked, “Strategies for effective paraphrasing,” in Proceedings of the 12th conference on Computational linguistics -, vol. 2. Budapest, Hungry: Association for Computational Linguistics, 1988, pp. 431–436. [Online]. Available: http://portal.acm.org/citation.cfm?doid=991719.991724
- R. Kozlowski, K. F. McCoy, and K. Vijay-Shanker, “Generation of single-sentence paraphrases from predicate/argument structure using lexico-grammatical resources,” in Proceedings of the second international workshop on Paraphrasing -, vol. 16. Sapporo, Japan: Association for Computational Linguistics, 2003, pp. 1–8. [Online]. Available: http://portal.acm.org/citation.cfm?doid=1118984.1118985
- I. A. Bolshakov and A. Gelbukh, “Synonymous Paraphrasing Using WordNet and Internet,” in Natural Language Processing and Information Systems, D. Hutchison, T. Kanade, J. Kittler, J. M. Kleinberg, F. Mattern, J. C. Mitchell, M. Naor, O. Nierstrasz, C. Pandu Rangan, B. Steffen, M. Sudan, D. Terzopoulos, D. Tygar, M. Y. Vardi, G. Weikum, F. Meziane, and E. Métais, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2004, vol. 3136, pp. 312–323, series Title: Lecture Notes in Computer Science. [Online]. Available: http://link.springer.com/10.1007/978-3-540-27779-8_27
- J. Zhou and S. Bhat, “Paraphrase Generation: A Survey of the State of the Art,” in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, M.-F. Moens, X. Huang, L. Specia, and S. W.-t. Yih, Eds. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics, Nov. 2021, pp. 5075–5086. [Online]. Available: https://aclanthology.org/2021.emnlp-main.414
- OpenAI, “GPT-4 Technical Report,” 2023, publisher: arXiv Version Number: 3. [Online]. Available: https://arxiv.org/abs/2303.08774
- J. Ganitkevitch, B. Van Durme, and C. Callison-Burch, “PPDB: The Paraphrase Database,” in Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Atlanta, Georgia: Association for Computational Linguistics, Jun. 2013, pp. 758–764. [Online]. Available: https://aclanthology.org/N13-1092
- W. Lan, S. Qiu, H. He, and W. Xu, “A Continuously Growing Dataset of Sentential Paraphrases,” in Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Copenhagen, Denmark: Association for Computational Linguistics, 2017, pp. 1224–1234. [Online]. Available: http://aclweb.org/anthology/D17-1126
- A. Fader, L. Zettlemoyer, and O. Etzioni, “Paraphrase-Driven Learning for Open Question Answering,” in Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Sofia, Bulgaria: Association for Computational Linguistics, Aug. 2013, pp. 1608–1618. [Online]. Available: https://aclanthology.org/P13-1158
- T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft COCO: Common Objects in Context,” in Computer Vision – ECCV 2014, D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, Eds. Cham: Springer International Publishing, 2014, vol. 8693, pp. 740–755, series Title: Lecture Notes in Computer Science. [Online]. Available: http://link.springer.com/10.1007/978-3-319-10602-1_48
- W. B. Dolan and C. Brockett, “Automatically Constructing a Corpus of Sentential Paraphrases,” in Proceedings of the Third International Workshop on Paraphrasing (IWP2005), 2005. [Online]. Available: https://aclanthology.org/I05-5002
- S. Iyer, N. Dandeka, and K. Csernai, “First Quora Dataset Release: Question Pairs,” 2017. [Online]. Available: https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs
- J. Wieting and K. Gimpel, “ParaNMT-50M: Pushing the Limits of Paraphrastic Sentence Embeddings with Millions of Machine Translations,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Melbourne, Australia: Association for Computational Linguistics, 2018, pp. 451–462. [Online]. Available: http://aclweb.org/anthology/P18-1042
- J. E. Hu, R. Rudinger, M. Post, and B. Van Durme, “ParaBank: Monolingual Bitext Generation and Sentential Paraphrasing via Lexically-constrained Neural Machine Translation,” Jan. 2019, arXiv:1901.03644 [cs]. [Online]. Available: http://arxiv.org/abs/1901.03644
- J. E. Hu, A. Singh, N. Holzenberger, M. Post, and B. Van Durme, “Large-Scale, Diverse, Paraphrastic Bitexts via Sampling and Clustering,” in Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL). Hong Kong, China: Association for Computational Linguistics, 2019, pp. 44–54. [Online]. Available: https://www.aclweb.org/anthology/K19-1005
- Y. Zhang, J. Baldridge, and L. He, “PAWS: Paraphrase Adversaries from Word Scrambling,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for Computational Linguistics, Jun. 2019, pp. 1298–1308. [Online]. Available: https://aclanthology.org/N19-1131
- OpenAI, “Moderation,” 2023. [Online]. Available: https://platform.openai.com/docs/guides/moderation
- OpenAI, “New and Improved Embedding Model,” 2023, publisher: OpenAI. [Online]. Available: https://openai.com/blog/new-and-improved-embedding-model
- T. Gao, X. Yao, and D. Chen, “SimCSE: Simple Contrastive Learning of Sentence Embeddings,” in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics, Nov. 2021, pp. 6894–6910. [Online]. Available: https://aclanthology.org/2021.emnlp-main.552
- Y. Jiang, L. Zhang, and W. Wang, “Improved Universal Sentence Embeddings with Prompt-based Contrastive Learning and Energy-based Learning,” in Findings of the Association for Computational Linguistics: EMNLP 2022. Abu Dhabi, United Arab Emirates: Association for Computational Linguistics, Dec. 2022, pp. 3021–3035. [Online]. Available: https://aclanthology.org/2022.findings-emnlp.220
- N. Reimers and I. Gurevych, “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Nov. 2019. [Online]. Available: http://arxiv.org/abs/1908.10084
- P. Qi, Y. Zhang, Y. Zhang, J. Bolton, and C. D. Manning, “Stanza: A Python Natural Language Processing Toolkit for Many Human Languages,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 2020.
- M. Pawlik and N. Augsten, “Efficient Computation of the Tree Edit Distance,” ACM Transactions on Database Systems, vol. 40, no. 1, pp. 1–40, Mar. 2015. [Online]. Available: https://dl.acm.org/doi/10.1145/2699485
- F. M. Zanzotto, A. Santilli, L. Ranaldi, D. Onorati, P. Tommasino, and F. Fallucchi, “KERMIT: Complementing Transformer Architectures with Encoders of Explicit Syntactic Interpretations,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Online: Association for Computational Linguistics, Nov. 2020, pp. 256–267. [Online]. Available: https://www.aclweb.org/anthology/2020.emnlp-main.18
- M. Post, “A Call for Clarity in Reporting BLEU Scores,” in Proceedings of the Third Conference on Machine Translation: Research Papers. Belgium, Brussels: Association for Computational Linguistics, Oct. 2018, pp. 186–191. [Online]. Available: https://www.aclweb.org/anthology/W18-6319
- Google, “Google Research,” 2023. [Online]. Available: https://github.com/google-research/google-research/tree/master
- C. Van Der Lee, A. Gatt, E. Van Miltenburg, S. Wubben, and E. Krahmer, “Best practices for the human evaluation of automatically generated text,” in Proceedings of the 12th International Conference on Natural Language Generation. Tokyo, Japan: Association for Computational Linguistics, 2019, pp. 355–368. [Online]. Available: https://www.aclweb.org/anthology/W19-8643
- Y. Liu, D. Iter, Y. Xu, S. Wang, R. Xu, and C. Zhu, “G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment,” 2023, publisher: arXiv Version Number: 3. [Online]. Available: https://arxiv.org/abs/2303.16634