DWE+: Dual-Way Matching Enhanced Framework for Multimodal Entity Linking (2404.04818v1)
Abstract: Multimodal entity linking (MEL) aims to utilize multimodal information (usually textual and visual information) to link ambiguous mentions to unambiguous entities in knowledge base. Current methods facing main issues: (1)treating the entire image as input may contain redundant information. (2)the insufficient utilization of entity-related information, such as attributes in images. (3)semantic inconsistency between the entity in knowledge base and its representation. To this end, we propose DWE+ for multimodal entity linking. DWE+ could capture finer semantics and dynamically maintain semantic consistency with entities. This is achieved by three aspects: (a)we introduce a method for extracting fine-grained image features by partitioning the image into multiple local objects. Then, hierarchical contrastive learning is used to further align semantics between coarse-grained information(text and image) and fine-grained (mention and visual objects). (b)we explore ways to extract visual attributes from images to enhance fusion feature such as facial features and identity. (c)we leverage Wikipedia and ChatGPT to capture the entity representation, achieving semantic enrichment from both static and dynamic perspectives, which better reflects the real-world entity semantics. Experiments on Wikimel, Richpedia, and Wikidiverse datasets demonstrate the effectiveness of DWE+ in improving MEL performance. Specifically, we optimize these datasets and achieve state-of-the-art performance on the enhanced datasets. The code and enhanced datasets are released on https://github.com/season1blue/DWET
- Building a Multimodal Entity Linking Dataset From Tweets. In Proceedings of the Twelfth Language Resources and Evaluation Conference. European Language Resources Association, Marseille, France, 4285–4292. https://aclanthology.org/2020.lrec-1.528
- Multimodal entity linking for tweets. In European Conference on Information Retrieval. Springer, 463–478.
- Named entity extraction for knowledge graphs: A literature overview. IEEE Access 8 (2020), 32862–32881.
- Dbpedia: A nucleus for a web of open data. In The Semantic Web: 6th International Semantic Web Conference, 2nd Asian Semantic Web Conference, ISWC 2007+ ASWC 2007, Busan, Korea, November 11-15, 2007. Proceedings. Springer, 722–735.
- ANPs extractor: Large-scale visual sentiment ontology and detectors using adjective noun pairs. In Proceedings of the 21st ACM international conference on Multimedia. https://doi.org/10.1145/2502081.2502282
- Few-shot named entity recognition with self-describing networks. arXiv preprint arXiv:2203.12252 (2022).
- Multimodal Dialog Systems with Dual Knowledge-enhanced Generative Pretrained Language Model. ACM Trans. Inf. Syst. 42, 2, Article 53 (nov 2023), 25 pages. https://doi.org/10.1145/3606368
- Hybrid Transformer with Multi-level Fusion for Multimodal Knowledge Graph Completion. arXiv preprint arXiv:2205.02357 (2022).
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North. https://doi.org/10.18653/v1/n19-1423
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
- Named entity disambiguation for noisy text. arXiv preprint arXiv:1706.09147 (2017).
- Multimodal entity linking: a new dataset and a baseline. In Proceedings of the 29th ACM International Conference on Multimedia. 993–1001.
- MS-Celeb-1M: A Dataset and Benchmark for Large-Scale Face Recognition. 87–102. https://doi.org/10.1007/978-3-319-46487-9_6
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.
- Modality-balanced models for visual dialogue. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 8091–8098.
- Siamese Neural Networks for One-shot Image Recognition. (Jan 2015).
- Dynamic Multimodal Fusion via Meta-Learning Towards Micro-Video Recommendation. ACM Trans. Inf. Syst. 42, 2, Article 47 (nov 2023), 26 pages. https://doi.org/10.1145/3617827
- MEGCF: Multimodal Entity Graph Collaborative Filtering for Personalized Recommendation. ACM Trans. Inf. Syst. 41, 2, Article 30 (apr 2023), 27 pages. https://doi.org/10.1145/3544106
- Search Result Reranking with Visual and Structure Information Sources. ACM Trans. Inf. Syst. 37, 3, Article 38 (jun 2019), 38 pages. https://doi.org/10.1145/3329188
- Exploring and evaluating attributes, values, and structures for entity alignment. arXiv preprint arXiv:2010.03249 (2020).
- Ilya Loshchilov and Frank Hutter. 2017. Decoupled Weight Decay Regularization. arXiv e-prints, Article arXiv:1711.05101 (Nov. 2017), arXiv:1711.05101 pages. arXiv:1711.05101 [cs.LG]
- Hierarchical question-image co-attention for visual question answering. Advances in neural information processing systems 29 (2016).
- Hierarchical Question-Image Co-Attention for Visual Question Answering.
- Using Multimodal Contrastive Knowledge Distillation for Video-Text Retrieval. IEEE Transactions on Circuits and Systems for Video Technology (2023).
- Joint-attention feature fusion network and dual-adaptive NMS for object detection. Knowledge-Based Systems 241 (2022), 108213.
- Covid-on-the-Web: Knowledge graph and services to advance COVID-19 research. In The Semantic Web–ISWC 2020: 19th International Semantic Web Conference, Athens, Greece, November 2–6, 2020, Proceedings, Part II 19. Springer, 294–310.
- Multimodal named entity disambiguation for noisy social media posts. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2000–2008.
- Ayoola Olafenwa. 2021. Simplifying object segmentation with pixellib library. Online.(2021). https://vixra. org/abs/2101.0122 (2021).
- OpenAI. 2023. GPT-4 Technical Report. arXiv e-prints, Article arXiv:2303.08774 (March 2023), arXiv:2303.08774 pages. https://doi.org/10.48550/arXiv.2303.08774 arXiv:2303.08774 [cs.CL]
- Learning Transferable Visual Models From Natural Language Supervision. http://arxiv.org/abs/2103.00020 arXiv:2103.00020 [cs].
- Tal Reiss and Yedid Hoshen. 2021. Mean-shifted contrastive loss for anomaly detection. arXiv preprint arXiv:2106.03844 (2021).
- Sefik Ilkin Serengil and Alper Ozpinar. 2021. HyperExtended LightFace: A Facial Attribute Analysis Framework. In 2021 International Conference on Engineering and Emerging Technologies (ICEET). IEEE, 1–4. https://doi.org/10.1109/ICEET53442.2021.9659697
- Entity linking with a knowledge base: Issues, techniques, and solutions. IEEE Transactions on Knowledge and Data Engineering 27, 2 (2014), 443–460.
- Generative Multimodal Entity Linking. arXiv preprint arXiv:2306.12725 (2023).
- A Dual-way Enhanced Framework from Text Matching Point of View for Multimodal Entity Linking. arXiv e-prints, Article arXiv:2312.11816 (Dec. 2023), arXiv:2312.11816 pages. https://doi.org/10.48550/arXiv.2312.11816 arXiv:2312.11816 [cs.AI]
- Visual Named Entity Linking: A New Dataset and A Baseline. arXiv preprint arXiv:2211.04872 (2022).
- Training data-efficient image transformers & distillation through attention. In International conference on machine learning. PMLR, 10347–10357.
- Alakananda Vempala and Daniel Preoţiuc-Pietro. 2019. Categorizing and inferring the relationship between the text and image of twitter posts. In Proceedings of the 57th annual meeting of the Association for Computational Linguistics. 2830–2840.
- Denny Vrandečić. 2012. Wikidata: A new platform for collaborative data collection. In Proceedings of the 21st international conference on world wide web. 1063–1064.
- Rasika Wagh and Payal Punde. 2018. Survey on sentiment analysis using twitter dataset. In 2018 Second International Conference on Electronics, Communication and Aerospace Technology (ICECA). IEEE, 208–211.
- Richpedia: a large-scale, comprehensive multi-modal knowledge graph. Big Data Research 22 (2020), 100159.
- Peng Wang. 2022. Multimodal Entity Linking with Gated Hierarchical Fusion and Contrastive Training. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, Madrid Spain, 938–948. https://doi.org/10.1145/3477495.3531867
- ITA: Image-Text Alignments for Multi-Modal Named Entity Recognition. arXiv preprint arXiv:2112.06482 (2021).
- WikiDiverse: A Multimodal Entity Linking Dataset with Diversified Contextual Topics and Entity Types. http://arxiv.org/abs/2204.06347 arXiv:2204.06347 [cs].
- Scalable Zero-shot Entity Linking with Dense Entity Retrieval.
- MMEL: A Joint Learning Framework for Multi-Mention Entity Linking. In Uncertainty in Artificial Intelligence. PMLR, 2411–2421.
- Attention-based multimodal entity linking with high-quality images. In International Conference on Database Systems for Advanced Applications. Springer, 533–548.
- Dynamic modeling cross-modal interactions in two-phase prediction for entity-relation extraction. IEEE Transactions on Neural Networks and Learning Systems (2021).
- Enhancing Chinese character representation with lattice-aligned attention. IEEE Transactions on Neural Networks and Learning Systems (2021).
- MCL: multi-granularity contrastive learning framework for Chinese NER. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 14011–14019.
- Weibo-MEL, Wikidata-MEL and Richpedia-MEL: Multimodal Entity Linking Benchmark Datasets. In Knowledge Graph and Semantic Computing: Knowledge Graph Empowers New Infrastructure Construction, Bing Qin, Zhi Jin, Haofen Wang, Jeff Pan, Yongbin Liu, and Bo An (Eds.). Vol. 1466. Springer Singapore, Singapore, 315–320. https://doi.org/10.1007/978-981-16-6471-7_27 Series Title: Communications in Computer and Information Science.
- Shezheng Song (12 papers)
- Shasha Li (57 papers)
- Shan Zhao (32 papers)
- Xiaopeng Li (166 papers)
- Chengyu Wang (93 papers)
- Jie Yu (98 papers)
- Jun Ma (347 papers)
- Tianwei Yan (6 papers)
- Bin Ji (28 papers)
- Xiaoguang Mao (27 papers)