Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Synth-Empathy: Towards High-Quality Synthetic Empathy Data (2407.21669v2)

Published 31 Jul 2024 in cs.CL and cs.LG

Abstract: In recent years, with the rapid advancements in LLMs, achieving excellent empathetic response capabilities has become a crucial prerequisite. Consequently, managing and understanding empathetic datasets have gained increasing significance. However, empathetic data are typically human-labeled, leading to insufficient datasets and wasted human labor. In this work, we present Synth-Empathy, an LLM-based data generation and quality and diversity selection pipeline that automatically generates high-quality empathetic data while discarding low-quality data. With the data generated from a low empathetic model, we are able to further improve empathetic response performance and achieve state-of-the-art (SoTA) results across multiple benchmarks. Moreover, our model achieves SoTA performance on various human evaluation benchmarks, demonstrating its effectiveness and robustness in real-world applications. Furthermore, we show the trade-off between data quantity and quality, providing insights into empathetic data generation and selection.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. A Survey of Multimodal Large Language Model from A Data-centric Perspective. arXiv preprint arXiv:2405.16640 (2024).
  2. COMET: Commonsense Transformers for Automatic Knowledge Graph Construction. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 4762–4779. https://aclanthology.org/P19-1470
  3. Alpagasus: Training a better alpaca with fewer data. arXiv preprint arXiv:2307.08701 (2023).
  4. Yangbin Chen and Chunfeng Liang. 2022. Wish I Can Feel What You Feel: A Neural Approach for Empathetic Response Generation. In Findings of the Association for Computational Linguistics: EMNLP 2022. 922–933. https://aclanthology.org/2022.findings-emnlp.65
  5. Lingua manga: A generic large language model centric system for data curation. arXiv preprint arXiv:2306.11702 (2023).
  6. Mark H Davis. 1983. Measuring individual differences in empathy: Evidence for a multidimensional approach. Journal of personality and social psychology 44, 1 (1983), 113.
  7. Mods: Model-oriented data selection for instruction tuning. arXiv preprint arXiv:2311.15653 (2023).
  8. Reformatted Alignment. CoRR abs/2402.12219 (2024).
  9. How large language models will disrupt data management. Proceedings of the VLDB Endowment 16, 11 (2023), 3302–3309.
  10. E-CORE: Emotion Correlation Enhanced Empathetic Dialogue Generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 10568–10586.
  11. COSMIC: COmmonSense knowledge for eMotion Identification in Conversations. In Findings of the Association for Computational Linguistics: EMNLP 2020. 2470–2481. https://aclanthology.org/2020.findings-emnlp.224
  12. DEBERTA: DECODING-ENHANCED BERT WITH DISENTANGLED ATTENTION. In International Conference on Learning Representations.
  13. Emp-RFT: Empathetic Response Generation via Recognizing Feature Transitions between Utterances. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 4118–4128. https://aclanthology.org/2022.naacl-main.303
  14. EmpDG: Multi-resolution Interactive Empathetic Dialogue Generation. In Proceedings of the 28th International Conference on Computational Linguistics. 4454–4466. https://aclanthology.org/2020.coling-main.394
  15. Knowledge bridging for empathetic dialogue generation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 10993–11001.
  16. Self-Alignment with Instruction Backtranslation. CoRR abs/2308.06259 (2023).
  17. Differentially Private Synthetic Data via Foundation Model APIs 1: Images. CoRR abs/2305.15560 (2023).
  18. MoEL: Mixture of Empathetic Listeners. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 121–132. https://aclanthology.org/D19-1012
  19. What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning. In The Twelfth International Conference on Learning Representations.
  20. MUFFIN: Curating Multi-Faceted Instructions for Improving Instruction-Following. CoRR abs/2312.02436 (2023).
  21. # InsTag: Instruction Tagging for Analyzing Supervised Fine-tuning of Large Language Models. In The Twelfth International Conference on Learning Representations.
  22. meta llama. 2024. Introducing Meta Llama 3: The most capable openly available LLM to date. https://ai.meta.com/blog/meta-llama-3/ Accessed: 2024-05-02.
  23. Demystifying Data Management for Large Language Models. In Companion of the 2024 International Conference on Management of Data. 547–555.
  24. Flexmoe: Scaling large-scale sparse pre-trained model training via dynamic device placement. Proceedings of the ACM on Management of Data 1, 1 (2023), 1–19.
  25. OpenAI. 2023a. ChatGPT. https://openai.com/blog/chatgpt
  26. R OpenAI. 2023b. GPT-4 technical report. arXiv (2023), 2303–08774.
  27. SelectLLM: Can LLMs Select Important Instructions to Annotate? arXiv preprint arXiv:2401.16553 (2024).
  28. Think Twice: A Human-like Two-Stage Conversational Agent for Emotional Response Generation. In Proceedings of the 2023 International Conference on Autonomous Agents and Multiagent Systems. 727–736.
  29. Harnessing the Power of Large Language Models for Empathetic Response Generation: Empirical Investigations and Improvements. In Findings of the Association for Computational Linguistics: EMNLP 2023. 6516–6528.
  30. Towards Empathetic Open-domain Conversation Models: A New Benchmark and Dataset. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 5370–5381. https://aclanthology.org/P19-1534
  31. Cem: Commonsense-aware empathetic response generation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 11229–11237.
  32. Ozan Sener and Silvio Savarese. 2017. Active learning for convolutional neural networks: A core-set approach. arXiv preprint arXiv:1708.00489 (2017).
  33. Efficient-Empathy: Towards Efficient and Effective Selection of Empathy Data. arXiv preprint arXiv:2407.01937 (2024).
  34. Rational Sensibility: LLM Enhanced Empathetic Response Generation Guided by Self-presentation Theory. arXiv preprint arXiv:2312.08702 (2023).
  35. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023).
  36. Immanuel Trummer. 2023. From BERT to GPT-3 codex: harnessing the potential of very large language models for data management. arXiv preprint arXiv:2306.09339 (2023).
  37. Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible Knowledge Selection. In Findings of the Association for Computational Linguistics: EMNLP 2022. 4634–4645. https://aclanthology.org/2022.findings-emnlp.340
  38. Enhancing Empathetic and Emotion Support Dialogue Generation with Prophetic Commonsense Inference. arXiv preprint arXiv:2311.15316 (2023).
  39. Do Generated Data Always Help Contrastive Learning? CoRR abs/2403.12448 (2024).
  40. Magicoder: Source Code Is All You Need. CoRR abs/2312.02120 (2023).
  41. Differentially Private Synthetic Data via Foundation Model APIs 2: Text. CoRR abs/2403.01749 (2024).
  42. Baize: An Open-Source Chat Model with Parameter-Efficient Tuning on Self-Chat Data. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023. 6268–6278.
  43. Rethinking the Instruction Quality: LIFT is What You Need. arXiv:2312.11508 [cs.CL]
  44. RefGPT: Dialogue Generation of GPT, by GPT, and for GPT. In Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023. Association for Computational Linguistics, 2511–2535.
  45. An Iterative Associative Memory Model for Empathetic Response Generation. arXiv preprint arXiv:2402.17959 (2024).
  46. Exploiting Emotion-Semantic Correlations for Empathetic Response Generation. In Findings of the Association for Computational Linguistics: EMNLP 2023. 4826–4837. https://aclanthology.org/2023.findings-emnlp.320
  47. CTSM: Combining Trait and State Emotions for Empathetic Response Model. arXiv preprint arXiv:2403.15516 (2024).
  48. Don’t Lose Yourself! Empathetic Response Generation via Explicit Self-Other Awareness. In Findings of the Association for Computational Linguistics: ACL 2023. 13331–13344. https://aclanthology.org/2023.findings-acl.843
  49. CASE: Aligning Coarse-to-Fine Cognition and Affection for Empathetic Response Generation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. 8223–8237. https://aclanthology.org/2023.acl-long.457
  50. Probing commonsense explanation in dialogue response generation. In Findings of the Association for Computational Linguistics: EMNLP 2021. 4132–4146.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Hao Liang (137 papers)
  2. Linzhuang Sun (18 papers)
  3. Jingxuan Wei (21 papers)
  4. Xijie Huang (26 papers)
  5. Linkun Sun (2 papers)
  6. Bihui Yu (16 papers)
  7. Conghui He (114 papers)
  8. Wentao Zhang (261 papers)
Citations (3)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets