Similarity Guided Multimodal Fusion Transformer for Semantic Location Prediction in Social Media (2405.05760v2)
Abstract: Semantic location prediction aims to derive meaningful location insights from multimodal social media posts, offering a more contextual understanding of daily activities than using GPS coordinates. This task faces significant challenges due to the noise and modality heterogeneity in "text-image" posts. Existing methods are generally constrained by inadequate feature representations and modal interaction, struggling to effectively reduce noise and modality heterogeneity. To address these challenges, we propose a Similarity-Guided Multimodal Fusion Transformer (SG-MFT) for predicting the semantic locations of users from their multimodal posts. First, we incorporate high-quality text and image representations by utilizing a pre-trained large vision-LLM. Then, we devise a Similarity-Guided Interaction Module (SIM) to alleviate modality heterogeneity and noise interference by incorporating both coarse-grained and fine-grained similarity guidance for improving modality interactions. Specifically, we propose a novel similarity-aware feature interpolation attention mechanism at the coarse-grained level, leveraging modality-wise similarity to mitigate heterogeneity and reduce noise within each modality. At the fine-grained level, we utilize a similarity-aware feed-forward block and element-wise similarity to further address the issue of modality heterogeneity. Finally, building upon pre-processed features with minimal noise and modal interference, we devise a Similarity-aware Fusion Module (SFM) to fuse two modalities with a cross-attention mechanism. Comprehensive experimental results clearly demonstrate the superior performance of our proposed method.
- Beit: BERT pre-training of image transformers. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022.
- Hybrid transformer with multi-level fusion for multimodal knowledge graph completion. In SIGIR ’22: The 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, July 11 - 15, 2022, pages 904–915. ACM, 2022.
- UNITER: universal image-text representation learning. In Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XXX, pages 104–120. Springer, 2020.
- ELECTRA: pre-training text encoders as discriminators rather than generators. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020.
- BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 4171–4186. Association for Computational Linguistics, 2019.
- An image is worth 16x16 words: Transformers for image recognition at scale. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021.
- An empirical study of training end-to-end vision-and-language transformers. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 18145–18155. IEEE, 2022.
- Balanced multimodal learning: An integrated framework for multi-task learning in audio-visual fusion, 2024.
- Transformer feed-forward layers are key-value memories. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pages 5484–5495. Association for Computational Linguistics, 2021.
- Deberta: decoding-enhanced bert with disentangled attention. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021.
- Decoupling the role of data, attention, and losses in multimodal transformers. Trans. Assoc. Comput. Linguistics, 9:570–585, 2021.
- Pixel-bert: Aligning image pixels with text by deep multi-modal transformers. CoRR, abs/2004.00849, 2020.
- Vilt: Vision-and-language transformer without convolution or region supervision. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, pages 5583–5594. PMLR, 2021.
- ALBERT: A lite BERT for self-supervised learning of language representations. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020.
- mplug: Effective and efficient vision-language learning by cross-modal skip-connections. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pages 7241–7259. Association for Computational Linguistics, 2022.
- Align before fuse: Vision and language representation learning with momentum distillation. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 9694–9705, 2021.
- Visualbert: A simple and performant baseline for vision and language. CoRR, abs/1908.03557, 2019.
- A transformer-based framework for poi-level social post geolocation. In European Conference on Information Retrieval, pages 588–604. Springer, 2023.
- Oscar: Object-semantics aligned pre-training for vision-language tasks. In Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XXX, pages 121–137. Springer, 2020.
- Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692, 2019.
- Swin transformer: Hierarchical vision transformer using shifted windows. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pages 9992–10002. IEEE, 2021.
- Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 13–23, 2019.
- A deep multi-modal fusion approach for semantic place prediction in social media. In Proceedings of the Workshop on Multimodal Understanding of Social, Affective and Subjective Attributes, MUSA2@MM 2017, Mountain View, CA, USA, October 27, 2017, pages 31–37. ACM, 2017.
- Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States, pages 3111–3119, 2013.
- Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pages 1532–1543. ACL, 2014.
- Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, pages 8748–8763. PMLR, 2021.
- Faster R-CNN: towards real-time object detection with region proposal networks. CoRR, abs/1506.01497, 2015.
- How much can CLIP benefit vision-and-language tasks? In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022.
- VL-BERT: pre-training of generic visual-linguistic representations. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020.
- LXMERT: learning cross-modality encoder representations from transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pages 5099–5110. Association for Computational Linguistics, 2019.
- Training data-efficient image transformers & distillation through attention. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, pages 10347–10357. PMLR, 2021a.
- Going deeper with image transformers. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pages 32–42. IEEE, 2021b.
- Semantic place prediction with user attribute in social media. IEEE Multim., pages 29–37, 2021a.
- Semantic place prediction with user attribute in social media. IEEE MultiMedia, 28(4):29–37, 2021b.
- Bridgetower: Building bridges between encoders in vision-language representation learning. In Thirty-Seventh AAAI Conference on Artificial Intelligence, AAAI 2023, Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence, IAAI 2023, Thirteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2023, Washington, DC, USA, February 7-14, 2023, pages 10637–10647. AAAI Press, 2023.
- Probing inter-modality: Visual parsing with self-attention for vision-and-language pre-training. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 4514–4528, 2021.
- Chinese clip: Contrastive vision-language pretraining in chinese. arXiv preprint arXiv:2211.01335, 2022.
- VOLO: vision outlooker for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell., 45(5):6575–6586, 2023.
- Unis-mmc: Multimodal classification via unimodality-supervised multimodal contrastive learning. In Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023, pages 659–672. Association for Computational Linguistics, 2023.
Collections
Sign up for free to add this paper to one or more collections.