STICKERCONV: Generating Multimodal Empathetic Responses from Scratch (2402.01679v2)
Abstract: Stickers, while widely recognized for enhancing empathetic communication in online interactions, remain underexplored in current empathetic dialogue research, notably due to the challenge of a lack of comprehensive datasets. In this paper, we introduce the Agent for STICKERCONV (Agent4SC), which uses collaborative agent interactions to realistically simulate human behavior with sticker usage, thereby enhancing multimodal empathetic communication. Building on this foundation, we develop a multimodal empathetic dialogue dataset, STICKERCONV, comprising 12.9K dialogue sessions, 5.8K unique stickers, and 2K diverse conversational scenarios. This dataset serves as a benchmark for multimodal empathetic generation. To advance further, we propose PErceive and Generate Stickers (PEGS), a multimodal empathetic response generation framework, complemented by a comprehensive set of empathy evaluation metrics based on LLM. Our experiments demonstrate PEGS's effectiveness in generating contextually relevant and emotionally resonant multimodal empathetic responses, contributing to the advancement of more nuanced and engaging empathetic dialogue systems.
- Umair Akram and Jennifer Drabble. 2022. Mental health memes: Beneficial or aversive in relation to psychiatric symptoms? 9(1):1–6.
- Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736.
- Out of one, many: Using language models to simulate human samples. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 819–862.
- Emergent autonomous scientific research capabilities of large language models. arXiv preprint arXiv:2304.05332.
- Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
- Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500.
- Eva: Exploring the limits of masked visual representation learning at scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19358–19369.
- MMDialog: A large-scale multi-turn dialogue dataset towards multi-modal open-domain conversation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7348–7363, Toronto, Canada. Association for Computational Linguistics.
- Jens Foerderer. 2023. Should we trust web-scraped data? arXiv preprint arXiv:2308.02231.
- E-core: Emotion correlation enhanced empathetic dialogue generation. arXiv preprint arXiv:2311.15016.
- Learning to respond with stickers: A framework of unifying multi-modality in multi-turn dialog. In Proceedings of The Web Conference 2020, WWW ’20, pages 1138–1148. Association for Computing Machinery.
- Towards exploiting sticker for multimodal sentiment analysis in social media: A new dataset and baseline. In Proceedings of the 29th International Conference on Computational Linguistics, pages 6795–6804, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
- Planting a seed of vision in large language model. arXiv preprint arXiv:2307.08041.
- Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
- Reciprocity, homophily, and social network effects in pictorial communication: A case study of bitmoji stickers. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, pages 1–14.
- Laughing at one’s self: A study of self-reflective internet memes. 1175(1):012250.
- Gill: Generating images with multimodal language models. arXiv preprint arXiv:2305.17216.
- Does GPT-3 generate empathetic dialogues? a novel in-context example selection method and automatic evaluation metric for empathetic dialogue generation. In Proceedings of the 29th International Conference on Computational Linguistics, pages 669–683, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
- Camel: Communicative agents for "mind" exploration of large scale language model society. arXiv preprint arXiv:2303.17760.
- A diversity-promoting objective function for neural conversation models. arXiv preprint arXiv:1510.03055.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597.
- Moel: Mixture of empathetic listeners. arXiv preprint arXiv:1908.07687.
- Llava-v1.5: Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744v1.
- Visual instruction tuning. arXiv preprint arXiv:2304.08485.
- Ser30k: A large-scale dataset for sticker emotion recognition. In Proceedings of the 30th ACM International Conference on Multimedia, MM ’22, pages 33–41. Association for Computing Machinery.
- Mime: Mimicking emotions for empathetic response generation. arXiv preprint arXiv:2010.01454.
- RÂ OpenAI. 2023. Gpt-4 technical report. arXiv, page arXiv preprint arXiv:2303.08774.
- Automated annotation with generative ai requires validation. arXiv preprint arXiv:2306.00176.
- Generative agents: Interactive simulacra of human behavior. arXiv preprint arXiv:2304.03442.
- Generating efficient training data via llm-based attribute manipulation. arXiv preprint arXiv:2307.07099.
- Communicative agents for software development. arXiv preprint arXiv:2207.07924.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR.
- Towards empathetic open-domain conversation models: A new benchmark and dataset. arXiv preprint arXiv:1811.00207.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695.
- Cem: Commonsense-aware empathetic response generation. arXiv preprint arXiv:2109.05739.
- Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114.
- A computational approach to understanding empathy expressed in text-based mental health support. arXiv preprint arXiv:2009.08441.
- Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pages 2256–2265. PMLR.
- Yang Song and Stefano Ermon. 2019. Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems, 32.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Characterchat: Learning towards conversational ai with personalized social support. arXiv preprint arXiv:2308.10278.
- Apollo’s oracle: Retrieval-augmented reasoning in multi-agent debates. arXiv preprint arXiv:2312.04854.
- A survey on large language model based autonomous agents. arXiv preprint arXiv:2308.11432.
- When large language model based agent meets user behavior analysis: A novel user simulation paradigm. arXiv preprint arXiv:2306.02552.
- Self-instruct: Aligning language models with self-generated instructions. arXiv preprint arXiv:2212.10560.
- Chain-of-thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903.
- Anuradha Welivita and Pearl Pu. 2020. A taxonomy of empathetic response intents in human social conversations. In Proceedings of the 28th International Conference on Computational Linguistics, pages 4886–4899, Barcelona, Spain (Online). International Committee on Computational Linguistics.
- Multimodal large language models: A survey. arXiv preprint arXiv:2311.13165.
- Next-gpt: Any-to-any multimodal llm. arXiv preprint arXiv:2309.05519.
- Blockchain-based crowdsourcing makes training dataset of machine learning no longer be in short supply. 2022:e7033626.
- A survey on multimodal large language models. arXiv preprint arXiv:2306.13549.
- Emmanuelle Zech and Bernard Rimé. 2005. Is talking about an emotional experience helpful? effects on emotional recovery and perceived benefits. 12(4):270–287.
- Glm-130b: An open bilingual pre-trained model.
- Selecting stickers in open-domain dialogue through multitask learning. arXiv preprint arXiv:2209.07697.
- Comae: A multi-factor hierarchical framework for empathetic response generation. arXiv preprint arXiv:2105.08316.
- Minigpt-5: Interleaved vision-and-language generation via generative vokens. arXiv preprint arXiv:2310.02239.
- Characterglm: Customizing chinese conversational ai characters with large language models. arXiv preprint arXiv:2311.16832.
- Facilitating multi-turn emotional support conversation with positive emotion elicitation: A reinforcement learning approach. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1714–1729, Toronto, Canada. Association for Computational Linguistics.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days freePaper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.