Synthesizing Sentiment-Controlled Feedback For Multimodal Text and Image Data (2402.07640v3)
Abstract: The ability to generate sentiment-controlled feedback in response to multimodal inputs comprising text and images addresses a critical gap in human-computer interaction. This capability allows systems to provide empathetic, accurate, and engaging responses, with useful applications in education, healthcare, marketing, and customer service. To this end, we have constructed a large-scale Controllable Multimodal Feedback Synthesis (CMFeed) dataset and propose a controllable feedback synthesis system. The system features an encoder, decoder, and controllability block for textual and visual inputs. It extracts features using a transformer and Faster R-CNN networks, combining them to generate feedback. The CMFeed dataset includes images, texts, reactions to the posts, human comments with relevance scores, and reactions to these comments. These reactions train the model to produce feedback with specified sentiments, achieving a sentiment classification accuracy of 77.23\%, which is 18.82\% higher than the accuracy without controllability. The system also incorporates a similarity module for assessing feedback relevance through rank-based metrics and an interpretability technique to analyze the contributions of textual and visual features during feedback generation. Access to the CMFeed dataset and the system's code is available at https://github.com/MIntelligence-Group/CMFeed.
- P. Kumar et al., “Affective Feedback Synthesis Towards Multimodal Text and Image Data,” ACM Transactions on Multimedia Computing, Communications and Applications, vol. 19, no. 6, pp. 1–23, 2023.
- M. R. Makiuchi, K. Uto et al., “Multimodal Emotion Recognition with High-level Speech and Text Features,” in IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2021.
- F. R. Gallo, G. I. Simari et al., “Predicting User Reactions to Twitter Feed Content based on Personality Type and Social Cues,” Future Generation Computer Systems, vol. 110, pp. 918–930, 2020.
- M. Muszynski et al., “Recognizing Induced Emotions of Movie Audiences from Multimodal Information,” IEEE Transactions on Affective Computing, vol. 12, no. 1, pp. 36–52, 2019.
- P. Blikstein and M. Worsley, “Multimodal Learning analytics and Education Data Mining: Using Computational Technologies to Measure Complex Learning Tasks,” Journal of Learning Analytics, vol. 3, no. 2, pp. 220–238, 2016.
- A. Vaswani, N. Shazeer et al., “Attention Is All You Need,” in Advances in Neural Information Processing Systems (NeurIPS), 2017, pp. 5998–6008.
- S. Ren, K. He et al., “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks,” Advances in Neural Information Processing Systems (NeurIPS), vol. 28, pp. 91–99, 2015.
- S. Gao, X. Chen et al., “From Standard Summarization to New Tasks and Beyond: Summarization With Manifold Information,” in The 29th International Joint Conference on Artificial Intelligence (IJCAI), 2020.
- X. Gu, K. Cho et al., “DialogWAE: Multimodal Response Generation With Conditional Wasserstein Auto Encoder,” in The International Conference on Learning Representations (ICLR), 2019.
- S. Antol, A. Agrawal et al., “VQA: Visual Question Answering,” in The 19th IEEE/CVF International Conference on Computer Vision (ICCV), 2015, pp. 2425–2433.
- A. Das, S. Kottur et al., “Learning Cooperative Visual Dialog Agents With Deep Reinforcement Learning,” in The 21th IEEE/CVF International Conference on Computer Vision (ICCV), 2017, pp. 2951–2960.
- H. Zhou, M. Huang et al., “Emotional Chatting Machine: Emotional Conversation Generation With Internal And External Memory,” in AAAI Conference on Artificial Intelligence, vol. 32, no. 1, 2018.
- Y. Wu, F. Wei et al., “Response Generation by Context Aware Prototype Editing,” in The 33rd AAAI Conference on Artificial Intelligence (AAAI), vol. 33, 2019, pp. 7281–7288.
- A. Radford, K. Narasimhan et al., “Improving Language Understanding by Generative Pre-training,” OpenAI, 2018.
- Y. Zhu, W. Zhao et al., “Topic-Aware Video Summarization Using Multimodal Transformer,” Pattern Recognition, vol. 140, p. 109578, 2023.
- J. Chen and H. Zhuge, “Abstractive Text-Image Summarization Using Multimodal Attention Hierarchical RNN,” in The Conference on Empirical Methods in Natural Language Processing (EMNLP), 2018, pp. 4046–4056.
- J. Zhu, H. Li et al., “MSMO: Multimodal Summarization With Multimodal Output,” in The Conference on Empirical Methods in Natural Language Processing (EMNLP), 2018, pp. 4154–4164.
- R. Zellers et al., “VMSMO: Learning to Generate Multimodal Summary for Video-based News Articles,” in The Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020, pp. 9360–9369.
- X. Shang, Z. Yuan et al., “Multimodal Video Summarization Via Time-Aware Transformers,” in 29th ACM International Conference on Multimedia, 2021, pp. 1756–1765.
- B. Zhao, M. Gong, and X. Li, “Audio-Visual Video Summarization,” IEEE Transactions on Neural Networks and Learning Systems, 2021.
- J. Xie, X. Chen et al., “Multimodal-Based And Aesthetic-Guided Narrative Video Summarization,” IEEE Transactions on Multimedia, 2022.
- M. Page Fortin and B. Chaib-draa, “Multimodal Multitask Emotion Recognition Using Images, texts and tags,” in The ACM International Conference on Multimedia Retrieval (ICLR), 2019, pp. 3–10.
- J. Zhu, Y. Zhou et al., “Multimodal Summarization With Guidance of Multimodal Reference,” in The 34th AAAI Conference on Artificial Intelligence (AAAI), vol. 34, no. 05, 2020, pp. 9749–9756.
- F. Chen, J. Xie et al., “Graph Convolutional Net For Difficulty-Controllable Visual Ques. Generation,” World Wide Web, pp. 1–23, 2023.
- P. Wang, Q. Wu et al., “FVQA: Fact-Based Visual Question Answering,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 10, pp. 2413–2427, 2017.
- P. Cascante-Bonilla, H. Wu et al., “SimVQA: Exploring Simulated Environments for Visual Question Answering,” in IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 5056–5066.
- Z.-X. Jin, H. Wu et al., “RUArt: A Novel Text-Centered Text-Based Visual Question Answering,” IEEE Transactions on Multimedia, 2021.
- Z. Guo, J. Zhao et al., “A Universal Quaternion Hypergraph For Multimodal VQA,” IEEE Transactions on Multimedia, 2021.
- J. Wu, T. Mu et al., “Memory-Aware Attentive Control For Community Question Answering With Knowledge-Based Dual Refinement,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2023.
- J. Lehmann et al., “Language Models As Controlled Natural Language Semantic Parsers For Knowledge Graph Question Answering,” in European Conference on Artificial Intelligence (ECAI), vol. 372. IOS Press, 2023, pp. 1348–1356.
- G.-C. Kang, J. Lim et al., “Dual Attention Networks for Visual Reference Resolution in Visual Dialog,” in The Conference on Empirical Methods in Natural Language Processing (EMNLP), 2019, pp. 2024–2033.
- X. Jiang et al., “DualVD: An Adaptive Dual Encoding Model for Deep Visual Understanding in Visual Dialogue,” in The 34th AAAI Conference on Artificial Intelligence (AAAI), vol. 34, no. 07, 2020, pp. 11 125–11 132.
- X. Xu, O. Dušek et al., “Better Conversations By Modeling, Filtering, And Optimizing For Coherence And Diversity,” arXiv preprint arXiv:1809.06873, 2018, Accessed 2023-01-31.
- T. Zhao et al., “Learning Discourse-Level Diversity For Neural Dialog Models Using Conditional Variational Autoencoders,” in The 55th Annual Meeting of Association for Comp. Linguistics (ACL), 2017, pp. 654–664.
- F. Chen, X. Chen et al., “Improving Cross-Modal Understanding In Visual Dialog Via Contrastive Learning,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 7937–7941.
- Z. Wang, J. Wang et al., “Unified Multimodal Model With Unlikelihood Training For Visual Dialog,” in The 30th ACM International Conference on Multimedia, 2022, pp. 4625–4634.
- G.-C. Kang, S. Kim et al., “The Dialog Must Go On: Improving Visual Dialog via Generative Self-Training,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 6746–6756.
- A.-A. Liu, G. Zhang et al., “Closed-Loop Reasoning With Graph-Aware Dense Interaction For Visual Dialog,” Multimedia Systems, vol. 28, no. 5, pp. 1823–1832, 2022.
- A.-A. Liu, C. Huang et al., “Counterfactual Visual Dialog: Robust Commonsense Knowledge Learning From Unbiased Training,” IEEE Transactions on Multimedia, 2023.
- W. Shi and Z. Yu, “Sentiment Adaptive End-to-End Dialog Systems,” in 56th Annual Meeting of the Association for Computational Linguistics (ACL), 2018, pp. 1509–1519.
- X. Kong, B. Li et al., “An Adversarial Approach To Sentiment-Controlled Neural Dialogue Generation,” arXiv preprint arXiv:1901.07129, 2019, Accessed 2023-01-31.
- M. Firdaus, H. Chauhan et al., “EmoSen: Generating Sentiment And Emotion Controlled Responses In A Multimodal Dialogue System,” IEEE Transactions on Affective Computing, vol. 13, no. 3, pp. 1555–1566, 2020.
- M. Firdaus, U. Jain et al., “SEPRG: Sentiment Aware Emotion Controlled Personalized Response Generation,” in 14th International Conference on Natural Language Generation, 2021, pp. 353–363.
- J. Hu, Y. Huang et al., “The Acoustically Emotion-Aware Conversational Agent With Speech Emotion And Empathetic Responses,” IEEE Transactions on Affective Computing, vol. 14, no. 1, pp. 17–30, 2022.
- T. Saha, S. Saha et al., “Towards Sentiment-Aware Multi-Modal Dialogue Policy Learning,” Cognitive Computation, pp. 1–15, 2022.
- K. Wang and X. Wan, “SentiGAN: Generating Sentimental Texts Via Mixture Adversarial Networks,” in International Joint Conference on Artificial Intelligence (IJCAI), 2018, pp. 4446–4452.
- C. Huang, O. R. Zaiane et al., “Automatic Dialogue Generation With Expressed Emotions,” in The Conference of North American Chapter of the Association for Computational Linguistics (NAACL), 2018, pp. 49–54.
- P. Zhong et al., “An Affect-Rich Neural Conversational Model With Biased Attention And Weighted Cross-Entropy Loss,” in AAAI Conference on Artificial Intelligence, vol. 33, no. 01, 2019, pp. 7492–7500.
- A. Rosebrock, “Intersection Over Union (IoU) for Object Detection,” PyImageSearch.com, 2016, Accessed 2023-01-31.
- N. Reimers and I. Gurevych, “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks,” in Conference on Empirical Methods in Natural Language Processing and International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019, pp. 3982–3992.
- P. Kumar, S. Malik et al., “Hybrid Fusion Based Interpretable Multimodal Emotion Recognition With Insufficient Labelled Data,” arXiv preprint arXiv:2208.11450, 2022, Accessed 2023-01-31.
- L. Shapley, “A Value for n-Person Games, Contributions to the Theory of Games II,” 1953.
- A. Akbik, T. Bergmann et al., “FLAIR: An Easy-to-Use Framework For State-Of-The-Art NLP,” in 2019 Conf. of North American Chapter of Association for Comp. linguistics (NAACL), 2019, pp. 54–59.
- T. Rinker. (2017) Sentimentr Package for R Language. https://github.com/trinker/sentimentr. Accessed 2023-01-31.
- V. Sanh, L. Debut et al., “DistilBERT, A Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter,” arXiv preprint arXiv:1910.01108, 2019, Accessed 2023-01-31.
- Y. Liu, M. Ott et al., “RoBERTa: A Robustly Optimized BERT Pretraining Approach,” arXiv preprint arXiv:1907.11692, 2019, Accessed 2023-01-31.
- K. Papineni, S. Roukos et al., “BLEU: A Method for Automatic Evaluation of Machine Translation,” in The 40th Annual Meeting on Association for Computational Linguistics (ACL), 2002, pp. 311–318.
- R. Vedantam, C. Lawrence Zitnick et al., “CIDEr: Consensus-based Image Description Evaluation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 4566–4575.
- C.-Y. Lin, “ROUGE: A Package for Automatic Evaluation of Summaries,” in Text Summarization Branches Out, 2004, pp. 74–81.
- P. Anderson, B. Fernando et al., “SPICE: Semantic Propositional Image Caption Evaluation,” in The European Conference on Computer Vision (ECCV), 2016, pp. 382–398.
- A. Lavie and M. J. Denkowski, “The METEOR Metric for Automatic Evaluation of Machine Translation,” Springer Machine Translation Journal, vol. 23, no. 2-3, pp. 105–115, 2009.
- P. Runeson, M. Alexandersson, and O. Nyholm, “Detection of Duplicate Defect Reports using Natural Language Processing,” in IEEE International Conference on Software Engineering, 2007, pp. 499–510.
- N. Craswell, “Mean Reciprocal Rank,” Encyclopedia of Database Systems, vol. 1703, 2009.
- OpenAI. (2019) GPT2. huggingface.co/docs/transformers/model_doc/gpt2. Accessed 2023-01-31.
- R. Deng, C. Shen et al., “Learning To Predict Crisp Boundaries,” in The European Conference on Computer Vision (ECCV), 2018, pp. 562–578.