Synth-Empathy: Towards High-Quality Synthetic Empathy Data (2407.21669v2)
Abstract: In recent years, with the rapid advancements in LLMs, achieving excellent empathetic response capabilities has become a crucial prerequisite. Consequently, managing and understanding empathetic datasets have gained increasing significance. However, empathetic data are typically human-labeled, leading to insufficient datasets and wasted human labor. In this work, we present Synth-Empathy, an LLM-based data generation and quality and diversity selection pipeline that automatically generates high-quality empathetic data while discarding low-quality data. With the data generated from a low empathetic model, we are able to further improve empathetic response performance and achieve state-of-the-art (SoTA) results across multiple benchmarks. Moreover, our model achieves SoTA performance on various human evaluation benchmarks, demonstrating its effectiveness and robustness in real-world applications. Furthermore, we show the trade-off between data quantity and quality, providing insights into empathetic data generation and selection.
- A Survey of Multimodal Large Language Model from A Data-centric Perspective. arXiv preprint arXiv:2405.16640 (2024).
- COMET: Commonsense Transformers for Automatic Knowledge Graph Construction. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 4762–4779. https://aclanthology.org/P19-1470
- Alpagasus: Training a better alpaca with fewer data. arXiv preprint arXiv:2307.08701 (2023).
- Yangbin Chen and Chunfeng Liang. 2022. Wish I Can Feel What You Feel: A Neural Approach for Empathetic Response Generation. In Findings of the Association for Computational Linguistics: EMNLP 2022. 922–933. https://aclanthology.org/2022.findings-emnlp.65
- Lingua manga: A generic large language model centric system for data curation. arXiv preprint arXiv:2306.11702 (2023).
- Mark H Davis. 1983. Measuring individual differences in empathy: Evidence for a multidimensional approach. Journal of personality and social psychology 44, 1 (1983), 113.
- Mods: Model-oriented data selection for instruction tuning. arXiv preprint arXiv:2311.15653 (2023).
- Reformatted Alignment. CoRR abs/2402.12219 (2024).
- How large language models will disrupt data management. Proceedings of the VLDB Endowment 16, 11 (2023), 3302–3309.
- E-CORE: Emotion Correlation Enhanced Empathetic Dialogue Generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 10568–10586.
- COSMIC: COmmonSense knowledge for eMotion Identification in Conversations. In Findings of the Association for Computational Linguistics: EMNLP 2020. 2470–2481. https://aclanthology.org/2020.findings-emnlp.224
- DEBERTA: DECODING-ENHANCED BERT WITH DISENTANGLED ATTENTION. In International Conference on Learning Representations.
- Emp-RFT: Empathetic Response Generation via Recognizing Feature Transitions between Utterances. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 4118–4128. https://aclanthology.org/2022.naacl-main.303
- EmpDG: Multi-resolution Interactive Empathetic Dialogue Generation. In Proceedings of the 28th International Conference on Computational Linguistics. 4454–4466. https://aclanthology.org/2020.coling-main.394
- Knowledge bridging for empathetic dialogue generation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 10993–11001.
- Self-Alignment with Instruction Backtranslation. CoRR abs/2308.06259 (2023).
- Differentially Private Synthetic Data via Foundation Model APIs 1: Images. CoRR abs/2305.15560 (2023).
- MoEL: Mixture of Empathetic Listeners. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 121–132. https://aclanthology.org/D19-1012
- What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning. In The Twelfth International Conference on Learning Representations.
- MUFFIN: Curating Multi-Faceted Instructions for Improving Instruction-Following. CoRR abs/2312.02436 (2023).
- # InsTag: Instruction Tagging for Analyzing Supervised Fine-tuning of Large Language Models. In The Twelfth International Conference on Learning Representations.
- meta llama. 2024. Introducing Meta Llama 3: The most capable openly available LLM to date. https://ai.meta.com/blog/meta-llama-3/ Accessed: 2024-05-02.
- Demystifying Data Management for Large Language Models. In Companion of the 2024 International Conference on Management of Data. 547–555.
- Flexmoe: Scaling large-scale sparse pre-trained model training via dynamic device placement. Proceedings of the ACM on Management of Data 1, 1 (2023), 1–19.
- OpenAI. 2023a. ChatGPT. https://openai.com/blog/chatgpt
- R OpenAI. 2023b. GPT-4 technical report. arXiv (2023), 2303–08774.
- SelectLLM: Can LLMs Select Important Instructions to Annotate? arXiv preprint arXiv:2401.16553 (2024).
- Think Twice: A Human-like Two-Stage Conversational Agent for Emotional Response Generation. In Proceedings of the 2023 International Conference on Autonomous Agents and Multiagent Systems. 727–736.
- Harnessing the Power of Large Language Models for Empathetic Response Generation: Empirical Investigations and Improvements. In Findings of the Association for Computational Linguistics: EMNLP 2023. 6516–6528.
- Towards Empathetic Open-domain Conversation Models: A New Benchmark and Dataset. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 5370–5381. https://aclanthology.org/P19-1534
- Cem: Commonsense-aware empathetic response generation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 11229–11237.
- Ozan Sener and Silvio Savarese. 2017. Active learning for convolutional neural networks: A core-set approach. arXiv preprint arXiv:1708.00489 (2017).
- Efficient-Empathy: Towards Efficient and Effective Selection of Empathy Data. arXiv preprint arXiv:2407.01937 (2024).
- Rational Sensibility: LLM Enhanced Empathetic Response Generation Guided by Self-presentation Theory. arXiv preprint arXiv:2312.08702 (2023).
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023).
- Immanuel Trummer. 2023. From BERT to GPT-3 codex: harnessing the potential of very large language models for data management. arXiv preprint arXiv:2306.09339 (2023).
- Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible Knowledge Selection. In Findings of the Association for Computational Linguistics: EMNLP 2022. 4634–4645. https://aclanthology.org/2022.findings-emnlp.340
- Enhancing Empathetic and Emotion Support Dialogue Generation with Prophetic Commonsense Inference. arXiv preprint arXiv:2311.15316 (2023).
- Do Generated Data Always Help Contrastive Learning? CoRR abs/2403.12448 (2024).
- Magicoder: Source Code Is All You Need. CoRR abs/2312.02120 (2023).
- Differentially Private Synthetic Data via Foundation Model APIs 2: Text. CoRR abs/2403.01749 (2024).
- Baize: An Open-Source Chat Model with Parameter-Efficient Tuning on Self-Chat Data. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023. 6268–6278.
- Rethinking the Instruction Quality: LIFT is What You Need. arXiv:2312.11508 [cs.CL]
- RefGPT: Dialogue Generation of GPT, by GPT, and for GPT. In Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023. Association for Computational Linguistics, 2511–2535.
- An Iterative Associative Memory Model for Empathetic Response Generation. arXiv preprint arXiv:2402.17959 (2024).
- Exploiting Emotion-Semantic Correlations for Empathetic Response Generation. In Findings of the Association for Computational Linguistics: EMNLP 2023. 4826–4837. https://aclanthology.org/2023.findings-emnlp.320
- CTSM: Combining Trait and State Emotions for Empathetic Response Model. arXiv preprint arXiv:2403.15516 (2024).
- Don’t Lose Yourself! Empathetic Response Generation via Explicit Self-Other Awareness. In Findings of the Association for Computational Linguistics: ACL 2023. 13331–13344. https://aclanthology.org/2023.findings-acl.843
- CASE: Aligning Coarse-to-Fine Cognition and Affection for Empathetic Response Generation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. 8223–8237. https://aclanthology.org/2023.acl-long.457
- Probing commonsense explanation in dialogue response generation. In Findings of the Association for Computational Linguistics: EMNLP 2021. 4132–4146.
- Hao Liang (137 papers)
- Linzhuang Sun (18 papers)
- Jingxuan Wei (21 papers)
- Xijie Huang (26 papers)
- Linkun Sun (2 papers)
- Bihui Yu (16 papers)
- Conghui He (114 papers)
- Wentao Zhang (261 papers)