Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LMStyle Benchmark: Evaluating Text Style Transfer for Chatbots (2403.08943v1)

Published 13 Mar 2024 in cs.CL

Abstract: Since the breakthrough of ChatGPT, LLMs have garnered significant attention in the research community. With the development of LLMs, the question of text style transfer for conversational models has emerged as a natural extension, where chatbots may possess their own styles or even characters. However, standard evaluation metrics have not yet been established for this new settings. This paper aims to address this issue by proposing the LMStyle Benchmark, a novel evaluation framework applicable to chat-style text style transfer (C-TST), that can measure the quality of style transfer for LLMs in an automated and scalable manner. In addition to conventional style strength metrics, LMStyle Benchmark further considers a novel aspect of metrics called appropriateness, a high-level metrics take account of coherence, fluency and other implicit factors without the aid of reference samples. Our experiments demonstrate that the new evaluation methods introduced by LMStyle Benchmark have a higher correlation with human judgments in terms of appropriateness. Based on LMStyle Benchmark, we present a comprehensive list of evaluation results for popular LLMs, including LLaMA, Alpaca, and Vicuna, reflecting their stylistic properties, such as formality and sentiment strength, along with their appropriateness.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. Falcon-40B: an open large language model with state-of-the-art performance.
  2. Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics, 5:135–146.
  3. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  4. Together Computer. 2023. Redpajama: An open source recipe to reproduce llama training dataset.
  5. Style transformer: Unpaired text style transfer without disentangled latent representation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5997–6007, Florence, Italy. Association for Computational Linguistics.
  6. Style transfer in text: Exploration and evaluation. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1).
  7. Koala: A dialogue model for academic research. Blog post.
  8. Multi-style transfer with discriminative feedback on disjoint corpus. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3500–3510, Online. Association for Computational Linguistics.
  9. Meet your favorite character: Open-domain chatbot mimicking fictional characters with only a few utterances. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5114–5132, Seattle, United States. Association for Computational Linguistics.
  10. More than a feeling: Accuracy and application of sentiment analysis. International Journal of Research in Marketing, 40(1):75–87.
  11. Ruining He and Julian McAuley. 2016. Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In Proceedings of the 25th International Conference on World Wide Web, WWW ’16, page 507–517, Republic and Canton of Geneva, CHE. International World Wide Web Conferences Steering Committee.
  12. M. G. Kendall. 1938. A new measure of rank correlation. Biometrika, 30(1/2):81–93.
  13. Yoon Kim. 2014. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1746–1751, Doha, Qatar. Association for Computational Linguistics.
  14. Tom Kocmi and Christian Federmann. 2023. Large language models are state-of-the-art evaluators of translation quality.
  15. Human judgement as a compass to navigate automatic metrics for formality transfer. In Proceedings of the 2nd Workshop on Human Evaluation of NLP Systems (HumEval), pages 102–115, Dublin, Ireland. Association for Computational Linguistics.
  16. Multidimensional evaluation for text style transfer using chatgpt.
  17. Delete, retrieve, generate: a simple approach to sentiment and style transfer. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1865–1874, New Orleans, Louisiana. Association for Computational Linguistics.
  18. DailyDialog: A manually labelled multi-turn dialogue dataset. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 986–995, Taipei, Taiwan. Asian Federation of Natural Language Processing.
  19. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
  20. Towards fine-grained text sentiment transfer. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2013–2022, Florence, Italy. Association for Computational Linguistics.
  21. A dual reinforcement learning framework for unsupervised text style transfer. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, IJCAI 2019.
  22. Prompt-based editing for text style transfer.
  23. Politeness transfer: A tag and generate approach. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics.
  24. Unsupervised text style transfer with padded masked language models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 8671–8680, Online. Association for Computational Linguistics.
  25. OpenAI. 2022. Introducing chatgpt.
  26. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
  27. Ellie Pavlick and Joel Tetreault. 2016. An empirical analysis of formality in online communication. Transactions of the Association for Computational Linguistics, 4:61–74.
  28. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, Doha, Qatar. Association for Computational Linguistics.
  29. Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191, Belgium, Brussels. Association for Computational Linguistics.
  30. Style transfer through back-translation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 866–876, Melbourne, Australia. Association for Computational Linguistics.
  31. Language models are unsupervised multitask learners.
  32. Sudha Rao and Joel Tetreault. 2018. Dear sir or madam, may I introduce the GYAFC dataset: Corpus, benchmarks and metrics for formality style transfer. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 129–140, New Orleans, Louisiana. Association for Computational Linguistics.
  33. Machel Reid and Victor Zhong. 2021. LEWIS: Levenshtein editing for unsupervised text style transfer. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 3932–3944, Online. Association for Computational Linguistics.
  34. A recipe for arbitrary text style transfer with large language models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 837–848, Dublin, Ireland. Association for Computational Linguistics.
  35. Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. In Conference on Empirical Methods in Natural Language Processing.
  36. Conversation style transfer using few-shot learning.
  37. Style transfer from non-parallel text by cross-alignment. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
  38. Can you put it all together: Evaluating conversational agents’ ability to blend skills. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2021–2030, Online. Association for Computational Linguistics.
  39. Prompt-and-rerank: A method for zero-shot and few-shot arbitrary textual style transfer with small language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 2195–2222, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  40. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
  41. Together. 2023. Releasing 3b and 7b redpajama-incite family of models including base, instruction-tuned & chat models.
  42. Llama: Open and efficient foundation language models.
  43. Yolanda Vazquez-Alvarez and Mark Huckvale. 2002. The reliability of the itu-t p.85 standard for the evaluation of text-to-speech systems.
  44. Ben Wang and Aran Komatsuzaki. 2021. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax.
  45. Is chatgpt a good nlg evaluator? a preliminary study.
  46. Bloom: A 176b-parameter open-access multilingual language model.
  47. Mask and infill: Applying masked language model for sentiment transfer. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, pages 5271–5277. International Joint Conferences on Artificial Intelligence Organization.
  48. Unpaired sentiment-to-sentiment translation: A cycled reinforcement learning approach. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Volume 1: Long Papers, pages 979–988. Association for Computational Linguistics.
  49. Paraphrasing for style. In Proceedings of COLING 2012, pages 2899–2914, Mumbai, India. The COLING 2012 Organizing Committee.
  50. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations.
  51. Character-level convolutional networks for text classification. In Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (1)
  1. Jianlin Chen (4 papers)
Citations (1)

Summary

We haven't generated a summary for this paper yet.