RNG: Reducing Multi-level Noise and Multi-grained Semantic Gap for Joint Multimodal Aspect-Sentiment Analysis (2405.13059v1)
Abstract: As an important multimodal sentiment analysis task, Joint Multimodal Aspect-Sentiment Analysis (JMASA), aiming to jointly extract aspect terms and their associated sentiment polarities from the given text-image pairs, has gained increasing concerns. Existing works encounter two limitations: (1) multi-level modality noise, i.e., instance- and feature-level noise; and (2) multi-grained semantic gap, i.e., coarse- and fine-grained gap. Both issues may interfere with accurate identification of aspect-sentiment pairs. To address these limitations, we propose a novel framework named RNG for JMASA. Specifically, to simultaneously reduce multi-level modality noise and multi-grained semantic gap, we design three constraints: (1) Global Relevance Constraint (GR-Con) based on text-image similarity for instance-level noise reduction, (2) Information Bottleneck Constraint (IB-Con) based on the Information Bottleneck (IB) principle for feature-level noise reduction, and (3) Semantic Consistency Constraint (SC-Con) based on mutual information maximization in a contrastive learning way for multi-grained semantic gap reduction. Extensive experiments on two datasets validate our new state-of-the-art performance.
- X. Ju, D. Zhang, R. Xiao, J. Li, S. Li, M. Zhang, and G. Zhou, “Joint multi-modal aspect-sentiment analysis with auxiliary cross-modal relation detection,” in EMNLP, 2021, pp. 4395–4405.
- J. Yu, J. Wang, R. Xia, and J. Li, “Targeted multimodal sentiment classification based on coarse-to-fine grained image-target matching,” in IJCAI, 2022, pp. 4482–4488.
- Z. Wu, C. Zheng, Y. Cai, J. Chen, H.-f. Leung, and Q. Li, “Multimodal representation with embedded visual guiding objects for named entity recognition in social media posts,” in ACM’MM, 2020, pp. 1038–1046.
- Y. Ling, J. Yu, and R. Xia, “Vision-language pre-training for multimodal aspect-based sentiment analysis,” in ACL, 2022, pp. 2149–2159.
- R. Zhou, W. Guo, X. Liu, S. Yu, Y. Zhang, and X. Yuan, “AoM: Detecting aspect-oriented information for multimodal aspect-based sentiment analysis,” in Findings of ACL, 2023, pp. 8184–8196.
- J. Yang, J. Duan, S. Tran, Y. Xu, S. Chanda, L. Chen, B. Zeng, T. Chilimbi, and J. Huang, “Vision-language pre-training with triple contrastive learning,” in CVPR, 2022, pp. 15 671–15 680.
- Y. Zhou, L. Huang, T. Guo, J. Han, and S. Hu, “A span-based joint model for opinion target extraction and target sentiment classification.” in IJCAI, 2019, pp. 5485–5491.
- Y. Liu, Y. Zhou, Z. Li, D. Wei, W. Zhou, and S. Hu, “Mrce: A multi-representation collaborative enhancement model for aspect-opinion pair extraction,” in ICONIP, 2022, pp. 260–271.
- Y. Liu, Y. Zhou, Z. Li, J. Wang, W. Zhou, and S. Hu, “Him: An end-to-end hierarchical interaction model for aspect sentiment triplet extraction,” TASLP, pp. 2272–2285, 2023.
- Z. Li, Y. Zhou, W. Zhang, Y. Liu, C. Yang, Z. Lian, and S. Hu, “Amoa: Global acoustic feature enhanced modal-order-aware network for multimodal sentiment analysis,” in COLING, 2022, pp. 7136–7146.
- Z. Li, Y. Zhou, Y. Liu, F. Zhu, C. Yang, and S. Hu, “Qap: A quantum-inspired adaptive-priority-learning model for multimodal emotion recognition,” in Findings of ACL, 2023, pp. 12 191–12 204.
- J. Yu, J. Jiang, L. Yang, and R. Xia, “Improving multimodal named entity recognition via entity span detection with unified multimodal transformer,” in ACL, 2020, pp. 3342–3352.
- L. Yang, J.-C. Na, and J. Yu, “Cross-modal multitask transformer for end-to-end multimodal aspect-based sentiment analysis,” IPM, vol. 59, no. 5, p. 103038, 2022.
- Z. Yu, J. Wang, L.-C. Yu, and X. Zhang, “Dual-encoder transformers with cross-modal alignment for multimodal aspect-based sentiment analysis,” in AACL, 2022, pp. 414–423.
- N. Tishby and N. Zaslavsky, “Deep learning and the information bottleneck principle,” in ITW, 2015.
- A. v. d. Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,” arXiv preprint arXiv:1807.03748, 2018.
- Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” arXiv preprint arXiv:1907.11692, 2019.
- A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in ICLR, 2020.
- W. Hua, Z. Dai, H. Liu, and Q. Le, “Transformer quality in linear time,” in ICML, 2022, pp. 9099–9117.
- A. A. Alemi, I. Fischer, J. V. Dillon, and K. Murphy, “Deep variational information bottleneck,” in ICLR, 2016.
- D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in ICLR, 2014.
- J. D. Lafferty, A. McCallum, and F. C. Pereira, “Conditional random fields: Probabilistic models for segmenting and labeling sequence data,” in ICML, 2001, pp. 282–289.
- A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in ICML, 2021, pp. 8748–8763.
- L. Yao, R. Huang, L. Hou, G. Lu, M. Niu, H. Xu, X. Liang, Z. Li, X. Jiang, and C. Xu, “Filip: Fine-grained interactive language-image pre-training,” in ICLR, 2021.
- G. Chen, Y. Tian, and Y. Song, “Joint aspect extraction and sentiment analysis with directional graph convolutional networks,” in COLING, 2020, pp. 272–279.