Similarity Guided Multimodal Fusion Transformer for Semantic Location Prediction in Social Media (2405.05760v2)

Published 9 May 2024 in cs.CV and cs.CL

Abstract: Semantic location prediction aims to derive meaningful location insights from multimodal social media posts, offering a more contextual understanding of daily activities than using GPS coordinates. This task faces significant challenges due to the noise and modality heterogeneity in "text-image" posts. Existing methods are generally constrained by inadequate feature representations and modal interaction, struggling to effectively reduce noise and modality heterogeneity. To address these challenges, we propose a Similarity-Guided Multimodal Fusion Transformer (SG-MFT) for predicting the semantic locations of users from their multimodal posts. First, we incorporate high-quality text and image representations by utilizing a pre-trained large vision-LLM. Then, we devise a Similarity-Guided Interaction Module (SIM) to alleviate modality heterogeneity and noise interference by incorporating both coarse-grained and fine-grained similarity guidance for improving modality interactions. Specifically, we propose a novel similarity-aware feature interpolation attention mechanism at the coarse-grained level, leveraging modality-wise similarity to mitigate heterogeneity and reduce noise within each modality. At the fine-grained level, we utilize a similarity-aware feed-forward block and element-wise similarity to further address the issue of modality heterogeneity. Finally, building upon pre-processed features with minimal noise and modal interference, we devise a Similarity-aware Fusion Module (SFM) to fuse two modalities with a cross-attention mechanism. Comprehensive experimental results clearly demonstrate the superior performance of our proposed method.

References (39)

Collections

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Similarity Guided Multimodal Fusion Transformer for Semantic Location Prediction in Social Media (2405.05760v2)

Collections

Summary

Follow-up Questions

Authors (4)

Tweets

Don't miss out on important new AI/ML research

Similarity Guided Multimodal Fusion Transformer for Semantic Location Prediction in Social Media (2405.05760v2)

Collections

Summary

Follow-up Questions

Related Papers

Authors (4)

Tweets

Don't miss out on important new AI/ML research