Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Text Style Transfer Evaluation Using Large Language Models (2308.13577v2)

Published 25 Aug 2023 in cs.CL and cs.LG

Abstract: Evaluating Text Style Transfer (TST) is a complex task due to its multifaceted nature. The quality of the generated text is measured based on challenging factors, such as style transfer accuracy, content preservation, and overall fluency. While human evaluation is considered to be the gold standard in TST assessment, it is costly and often hard to reproduce. Therefore, automated metrics are prevalent in these domains. Nevertheless, it remains unclear whether these automated metrics correlate with human evaluations. Recent strides in LLMs have showcased their capacity to match and even exceed average human performance across diverse, unseen tasks. This suggests that LLMs could be a feasible alternative to human evaluation and other automated metrics in TST evaluation. We compare the results of different LLMs in TST using multiple input prompts. Our findings highlight a strong correlation between (even zero-shot) prompting and human evaluation, showing that LLMs often outperform traditional automated metrics. Furthermore, we introduce the concept of prompt ensembling, demonstrating its ability to enhance the robustness of TST evaluation. This research contributes to the ongoing evaluation of LLMs in diverse tasks, offering insights into successful outcomes and areas of limitation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. Falcon-40B: an open large language model with state-of-the-art performance.
  2. Satanjeev Banerjee and Alon Lavie. 2005. METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization@ACL 2005, Ann Arbor, Michigan, USA, June 29, 2005, pages 65–72. Association for Computational Linguistics.
  3. Evaluating the evaluation metrics for style transfer: A case study in multilingual formality transfer. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pages 1321–1336. Association for Computational Linguistics.
  4. A review of human evaluation for style transfer. In Proceedings of the 1st Workshop on Natural Language Generation, Evaluation, and Metrics (GEM 2021), pages 58–67.
  5. Language models are few-shot learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
  6. David Cheng-Han Chiang and Hung-yi Lee. 2023. Can large language models be an alternative to human evaluations? CoRR, abs/2305.01937.
  7. All that’s ’human’ is not gold: Evaluating human evaluation of generated text. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, pages 7282–7296. Association for Computational Linguistics.
  8. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 4171–4186. Association for Computational Linguistics.
  9. Bootstrapping dialog systems with word embeddings. In Nips, modern machine learning and natural language processing workshop, volume 2, page 168.
  10. Chatgpt outperforms crowd-workers for text-annotation tasks. arXiv preprint arXiv:2303.15056.
  11. Is ChatGPT better than Human Annotators? Potential and Limitations of ChatGPT in Explaining Implicit Hate Speech. In Companion Proceedings of the ACM Web Conference 2023, pages 294–297. ArXiv:2302.07736 [cs].
  12. Style versus content: A distinction without a (learnable) difference? In Proceedings of the 28th International Conference on Computational Linguistics, COLING 2020, Barcelona, Spain (Online), December 8-13, 2020, pages 2169–2180. International Committee on Computational Linguistics.
  13. Deep learning for text style transfer: A survey. Comput. Linguistics, 48(1):155–205.
  14. Bag of tricks for efficient text classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2017, Valencia, Spain, April 3-7, 2017, Volume 2: Short Papers, pages 427–431. Association for Computational Linguistics.
  15. Yoon Kim. 2014. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pages 1746–1751. ACL.
  16. From word embeddings to document distances. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, volume 37 of JMLR Workshop and Conference Proceedings, pages 957–966. JMLR.org.
  17. Delete, retrieve, generate: a simple approach to sentiment and style transfer. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 1 (Long Papers), pages 1865–1874. Association for Computational Linguistics.
  18. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Comput. Surv., 55(9):195:1–195:35.
  19. Evaluating Style Transfer for Text. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 495–504, Minneapolis, Minnesota. Association for Computational Linguistics.
  20. Tong Niu and Mohit Bansal. 2018. Polite dialogue generation without parallel data. Trans. Assoc. Comput. Linguistics, 6:373–389.
  21. A call for standardization and validation of text style transfer evaluation.
  22. Training language models to follow instructions with human feedback. In NeurIPS.
  23. Richard Yuanzhe Pang and Kevin Gimpel. 2019. Unsupervised evaluation metrics and learning criteria for non-parallel textual transfer. In Proceedings of the 3rd Workshop on Neural Generation and Translation@EMNLP-IJCNLP 2019, Hong Kong, November 4, 2019, pages 138–147. Association for Computational Linguistics.
  24. Bleu: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, July 6-12, 2002, Philadelphia, PA, USA., pages 311–318.
  25. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
  26. Sudha Rao and Joel R. Tetreault. 2018. Dear Sir or Madam, May I Introduce the GYAFC Dataset: Corpus, Benchmarks and Metrics for Formality Style Transfer. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 1 (Long Papers), pages 129–140.
  27. Vasile Rus and Mihai C. Lintean. 2012. A comparison of greedy and optimal assessment of natural language student input using word-to-word similarity metrics. In Proceedings of the Seventh Workshop on Building Educational Applications Using NLP, BEA@NAACL-HLT 2012, June 7, 2012, Montréal, Canada, pages 157–162. The Association for Computer Linguistics.
  28. BLOOM: A 176b-parameter open-access multilingual language model. CoRR, abs/2211.05100.
  29. Style Transfer from Non-Parallel Text by Cross-Alignment. In Advances in Neural Information Processing Systems 30, pages 6830–6841. Curran Associates, Inc.
  30. Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288.
  31. David H. Wolpert. 1992. Stacked generalization. Neural Networks, 5(2):241–259.
  32. Style-transfer and paraphrase: Looking for a sensible semantic similarity metric. In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2-9, 2021, pages 14213–14220. AAAI Press.
  33. OPT: open pre-trained transformer language models. CoRR, abs/2205.01068.
  34. Adversarially regularized autoencoders. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pages 5897–5906. PMLR.
  35. Alpa: Automating inter- and intra-operator parallelism for distributed deep learning. In 16th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2022, Carlsbad, CA, USA, July 11-13, 2022, pages 559–578. USENIX Association.
  36. Ensembling neural networks: Many could be better than all. Artif. Intell., 137(1-2):239–263.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Phil Ostheimer (7 papers)
  2. Mayank Nagda (7 papers)
  3. Marius Kloft (65 papers)
  4. Sophie Fellenz (21 papers)
Citations (8)