Self-Improving Customer Review Response Generation Based on LLMs (2405.03845v1)
Abstract: Previous studies have demonstrated that proactive interaction with user reviews has a positive impact on the perception of app users and encourages them to submit revised ratings. Nevertheless, developers encounter challenges in managing a high volume of reviews, particularly in the case of popular apps with a substantial influx of daily reviews. Consequently, there is a demand for automated solutions aimed at streamlining the process of responding to user reviews. To address this, we have developed a new system for generating automatic responses by leveraging user-contributed documents with the help of retrieval-augmented generation (RAG) and advanced LLMs. Our solution, named SCRABLE, represents an adaptive customer review response automation that enhances itself with self-optimizing prompts and a judging mechanism based on LLMs. Additionally, we introduce an automatic scoring mechanism that mimics the role of a human evaluator to assess the quality of responses generated in customer review domains. Extensive experiments and analyses conducted on real-world datasets reveal that our method is effective in producing high-quality responses, yielding improvement of more than 8.5% compared to the baseline. Further validation through manual examination of the generated responses underscores the efficacy our proposed system.
- Benchmarking foundation models with language-model-as-an-examiner. arXiv preprint arXiv:2306.04181.
- Satanjeev Banerjee and Alon Lavie. 2005. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72.
- Santosh Kumar Bharti and Korra Sathya Babu. 2017. Automatic keyword extraction for text summarization: A survey.
- Prompted opinion summarization with gpt-3.5.
- Few-shot learning for opinion summarization. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4119–4135, Online. Association for Computational Linguistics.
- Efficient few-shot fine-tuning for opinion summarization.
- Language models are few-shot learners.
- Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712.
- Yue Cao and Fatemeh H. Fard. 2022. Pre-trained neural language models for automatic mobile app user feedback answer generation.
- Chateval: Towards better llm-based evaluators through multi-agent debate. arXiv preprint arXiv:2308.07201.
- Generating persuasive responses to customer reviews with multi-source prior knowledge in e-commerce. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, pages 2994–3002.
- Cheng-Han Chiang and Hung-yi Lee. 2023. Can large language models be an alternative to human evaluations? arXiv preprint arXiv:2305.01937.
- Learning phrase representations using rnn encoder-decoder for statistical machine translation.
- Scaling instruction-finetuned language models.
- Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314.
- Bert: Pre-training of deep bidirectional transformers for language understanding.
- A survey on in-context learning.
- Alpacafarm: A simulation framework for methods that learn from human feedback. arXiv preprint arXiv:2305.14387.
- Ragas: Automated evaluation of retrieval augmented generation. arXiv preprint arXiv:2309.15217.
- Aakanksha Chowdhery et al. 2022. Palm: Scaling language modeling with pathways.
- App-aware response synthesis for user reviews. In 2020 IEEE International Conference on Big Data (Big Data), pages 699–708. IEEE.
- The devil is in the errors: Leveraging large language models for fine-grained machine translation evaluation. arXiv preprint arXiv:2308.07286.
- Gptscore: Evaluate as you desire. arXiv preprint arXiv:2302.04166.
- Automating app review response generation.
- Emerging app issue identification from user feedback: Experience on wechat. In 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), pages 279–288.
- Automating app review response generation based on contextual knowledge. ACM Trans. Softw. Eng. Methodol., 31(1).
- Human-like summarization evaluation with chatgpt. arXiv preprint arXiv:2304.02554.
- Retrieval-augmented generation for large language models: A survey.
- The false promise of imitating proprietary llms. arXiv preprint arXiv:2305.15717.
- Retrieval augmented language model pre-training. In International conference on machine learning, pages 3929–3938. PMLR.
- Emitza Guzman and Walid Maalej. 2014. How do users like this feature? a fine grained sentiment analysis of app reviews. In 2014 IEEE 22nd International Requirements Engineering Conference (RE), pages 153–162.
- Studying the dialogue between users and developers of free apps in the google play store. Empirical Software Engineering, 23:1275–1312.
- Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299.
- Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 7(3):535–547.
- Artificially human: Examining the potential of text-generating technologies in online customer feedback management.
- Prometheus: Inducing fine-grained evaluation capability in language models. arXiv preprint arXiv:2310.08491.
- Evallm: Interactive evaluation of large language model prompts on user-defined criteria. arXiv preprint arXiv:2309.13633.
- Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461.
- Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474.
- Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval.
- Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
- G-eval: Nlg evaluation using gpt-4 with better human alignment, may 2023. arXiv preprint arXiv:2303.16634, 6.
- Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651.
- A survey of app store analysis for software engineering. IEEE Transactions on Software Engineering, 43(9):817–847.
- Maalej W Pagano D. 2013. User feedback in the appstore: An empirical study. In Proceedings of the 2013 21st IEEE international requirements engineering conference, pages 125–134.
- Recommending and localizing change requests for mobile apps based on user reviews. In 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE), pages 106–117.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318.
- Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277.
- Automatic prompt optimization with gradient descent and beam search. arXiv preprint arXiv:2305.03495.
- Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. arXiv e-prints, page arXiv:1910.10683.
- Semantic answer similarity for evaluating question answering models. arXiv preprint arXiv:2108.06130.
- How much knowledge can you pack into the parameters of a language model?
- Ares: An automated evaluation framework for retrieval-augmented generation systems. arXiv preprint arXiv:2311.09476.
- Branch-solve-merge improves large language model evaluation and generation. arXiv preprint arXiv:2310.15123.
- Large language models are not yet human-level evaluators for abstractive summarization. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 4215–4233.
- Sequence to sequence learning with neural networks.
- Efficient few-shot learning without prompts.
- Alan M Turing. 2009. Computing machinery and intelligence. Springer.
- Attention is all you need. Advances in neural information processing systems, 30.
- Large language models are not fair evaluators. arXiv preprint arXiv:2305.17926.
- Promptagent: Strategic planning with language models enables expert-level prompt optimization. arXiv preprint arXiv:2310.16427.
- How far can camels go? exploring the state of instruction tuning on open resources. arXiv preprint arXiv:2306.04751.
- Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560.
- Huggingface’s transformers: State-of-the-art natural language processing.
- An explanation of in-context learning as implicit bayesian inference.
- Large language models as optimizers. arXiv preprint arXiv:2309.03409.
- Bartscore: Evaluating generated text as text generation. Advances in Neural Information Processing Systems, 34:27263–27277.
- Self-rewarding language models.
- Opt: Open pre-trained transformer language models.
- Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675.
- A transformer-based approach for improving app review response generation. Software: Practice and Experience, 53(2):438–454.
- Moverscore: Text generation evaluating with contextualized embeddings and earth mover distance. arXiv preprint arXiv:1909.02622.
- Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685.
- Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206.
- Large language models are human-level prompt engineers. arXiv preprint arXiv:2211.01910.
- Judgelm: Fine-tuned large language models are scalable judges.