Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Self-Improving Customer Review Response Generation Based on LLMs (2405.03845v1)

Published 6 May 2024 in cs.CL and cs.AI

Abstract: Previous studies have demonstrated that proactive interaction with user reviews has a positive impact on the perception of app users and encourages them to submit revised ratings. Nevertheless, developers encounter challenges in managing a high volume of reviews, particularly in the case of popular apps with a substantial influx of daily reviews. Consequently, there is a demand for automated solutions aimed at streamlining the process of responding to user reviews. To address this, we have developed a new system for generating automatic responses by leveraging user-contributed documents with the help of retrieval-augmented generation (RAG) and advanced LLMs. Our solution, named SCRABLE, represents an adaptive customer review response automation that enhances itself with self-optimizing prompts and a judging mechanism based on LLMs. Additionally, we introduce an automatic scoring mechanism that mimics the role of a human evaluator to assess the quality of responses generated in customer review domains. Extensive experiments and analyses conducted on real-world datasets reveal that our method is effective in producing high-quality responses, yielding improvement of more than 8.5% compared to the baseline. Further validation through manual examination of the generated responses underscores the efficacy our proposed system.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (76)
  1. Benchmarking foundation models with language-model-as-an-examiner. arXiv preprint arXiv:2306.04181.
  2. Satanjeev Banerjee and Alon Lavie. 2005. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72.
  3. Santosh Kumar Bharti and Korra Sathya Babu. 2017. Automatic keyword extraction for text summarization: A survey.
  4. Prompted opinion summarization with gpt-3.5.
  5. Few-shot learning for opinion summarization. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4119–4135, Online. Association for Computational Linguistics.
  6. Efficient few-shot fine-tuning for opinion summarization.
  7. Language models are few-shot learners.
  8. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712.
  9. Yue Cao and Fatemeh H. Fard. 2022. Pre-trained neural language models for automatic mobile app user feedback answer generation.
  10. Chateval: Towards better llm-based evaluators through multi-agent debate. arXiv preprint arXiv:2308.07201.
  11. Generating persuasive responses to customer reviews with multi-source prior knowledge in e-commerce. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, pages 2994–3002.
  12. Cheng-Han Chiang and Hung-yi Lee. 2023. Can large language models be an alternative to human evaluations? arXiv preprint arXiv:2305.01937.
  13. Learning phrase representations using rnn encoder-decoder for statistical machine translation.
  14. Scaling instruction-finetuned language models.
  15. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314.
  16. Bert: Pre-training of deep bidirectional transformers for language understanding.
  17. A survey on in-context learning.
  18. Alpacafarm: A simulation framework for methods that learn from human feedback. arXiv preprint arXiv:2305.14387.
  19. Ragas: Automated evaluation of retrieval augmented generation. arXiv preprint arXiv:2309.15217.
  20. Aakanksha Chowdhery et al. 2022. Palm: Scaling language modeling with pathways.
  21. App-aware response synthesis for user reviews. In 2020 IEEE International Conference on Big Data (Big Data), pages 699–708. IEEE.
  22. The devil is in the errors: Leveraging large language models for fine-grained machine translation evaluation. arXiv preprint arXiv:2308.07286.
  23. Gptscore: Evaluate as you desire. arXiv preprint arXiv:2302.04166.
  24. Automating app review response generation.
  25. Emerging app issue identification from user feedback: Experience on wechat. In 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), pages 279–288.
  26. Automating app review response generation based on contextual knowledge. ACM Trans. Softw. Eng. Methodol., 31(1).
  27. Human-like summarization evaluation with chatgpt. arXiv preprint arXiv:2304.02554.
  28. Retrieval-augmented generation for large language models: A survey.
  29. The false promise of imitating proprietary llms. arXiv preprint arXiv:2305.15717.
  30. Retrieval augmented language model pre-training. In International conference on machine learning, pages 3929–3938. PMLR.
  31. Emitza Guzman and Walid Maalej. 2014. How do users like this feature? a fine grained sentiment analysis of app reviews. In 2014 IEEE 22nd International Requirements Engineering Conference (RE), pages 153–162.
  32. Studying the dialogue between users and developers of free apps in the google play store. Empirical Software Engineering, 23:1275–1312.
  33. Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299.
  34. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 7(3):535–547.
  35. Artificially human: Examining the potential of text-generating technologies in online customer feedback management.
  36. Prometheus: Inducing fine-grained evaluation capability in language models. arXiv preprint arXiv:2310.08491.
  37. Evallm: Interactive evaluation of large language model prompts on user-defined criteria. arXiv preprint arXiv:2309.13633.
  38. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461.
  39. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474.
  40. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval.
  41. Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
  42. G-eval: Nlg evaluation using gpt-4 with better human alignment, may 2023. arXiv preprint arXiv:2303.16634, 6.
  43. Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651.
  44. A survey of app store analysis for software engineering. IEEE Transactions on Software Engineering, 43(9):817–847.
  45. Maalej W Pagano D. 2013. User feedback in the appstore: An empirical study. In Proceedings of the 2013 21st IEEE international requirements engineering conference, pages 125–134.
  46. Recommending and localizing change requests for mobile apps based on user reviews. In 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE), pages 106–117.
  47. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318.
  48. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277.
  49. Automatic prompt optimization with gradient descent and beam search. arXiv preprint arXiv:2305.03495.
  50. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. arXiv e-prints, page arXiv:1910.10683.
  51. Semantic answer similarity for evaluating question answering models. arXiv preprint arXiv:2108.06130.
  52. How much knowledge can you pack into the parameters of a language model?
  53. Ares: An automated evaluation framework for retrieval-augmented generation systems. arXiv preprint arXiv:2311.09476.
  54. Branch-solve-merge improves large language model evaluation and generation. arXiv preprint arXiv:2310.15123.
  55. Large language models are not yet human-level evaluators for abstractive summarization. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 4215–4233.
  56. Sequence to sequence learning with neural networks.
  57. Efficient few-shot learning without prompts.
  58. Alan M Turing. 2009. Computing machinery and intelligence. Springer.
  59. Attention is all you need. Advances in neural information processing systems, 30.
  60. Large language models are not fair evaluators. arXiv preprint arXiv:2305.17926.
  61. Promptagent: Strategic planning with language models enables expert-level prompt optimization. arXiv preprint arXiv:2310.16427.
  62. How far can camels go? exploring the state of instruction tuning on open resources. arXiv preprint arXiv:2306.04751.
  63. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560.
  64. Huggingface’s transformers: State-of-the-art natural language processing.
  65. An explanation of in-context learning as implicit bayesian inference.
  66. Large language models as optimizers. arXiv preprint arXiv:2309.03409.
  67. Bartscore: Evaluating generated text as text generation. Advances in Neural Information Processing Systems, 34:27263–27277.
  68. Self-rewarding language models.
  69. Opt: Open pre-trained transformer language models.
  70. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675.
  71. A transformer-based approach for improving app review response generation. Software: Practice and Experience, 53(2):438–454.
  72. Moverscore: Text generation evaluating with contextualized embeddings and earth mover distance. arXiv preprint arXiv:1909.02622.
  73. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685.
  74. Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206.
  75. Large language models are human-level prompt engineers. arXiv preprint arXiv:2211.01910.
  76. Judgelm: Fine-tuned large language models are scalable judges.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com