Matching Images and Text with Multi-modal Tensor Fusion and Re-ranking (1908.04011v2)

Published 12 Aug 2019 in cs.CV

Abstract: A major challenge in matching images and text is that they have intrinsically different data distributions and feature representations. Most existing approaches are based either on embedding or classification, the first one mapping image and text instances into a common embedding space for distance measuring, and the second one regarding image-text matching as a binary classification problem. Neither of these approaches can, however, balance the matching accuracy and model complexity well. We propose a novel framework that achieves remarkable matching performance with acceptable model complexity. Specifically, in the training stage, we propose a novel Multi-modal Tensor Fusion Network (MTFN) to explicitly learn an accurate image-text similarity function with rank-based tensor fusion rather than seeking a common embedding space for each image-text instance. Then, during testing, we deploy a generic Cross-modal Re-ranking (RR) scheme for refinement without requiring additional training procedure. Extensive experiments on two datasets demonstrate that our MTFN-RR consistently achieves the state-of-the-art matching performance with much less time complexity. The implementation code is available at https://github.com/Wangt-CN/MTFN-RR-PyTorch-Code.

PDF Abstract

Matching Images and Text with Multi-modal Tensor Fusion and Re-ranking

The challenge of effectively matching images and text arises from their intrinsically different data distributions and feature representations. This paper introduces a novel framework that addresses the limitations of previous approaches, primarily embedding and classification-based methods, which often fail to balance matching accuracy with model complexity. The proposed framework innovatively leverages Multi-modal Tensor Fusion and a Cross-modal Re-ranking scheme to achieve state-of-the-art matching performance with reduced time complexity.

Approach and Methodology

The paper introduces the Multi-modal Tensor Fusion Network (MTFN) to learn an accurate image-text similarity function. Rather than relying on a shared embedding space, the MTFN employs rank-based tensor fusion to directly measure similarity during the training phase. This approach bypasses the significant computational overhead associated with constructing high-dimensional common spaces typical of embedding-based methods.

The testing phase incorporates a Cross-modal Re-ranking (RR) mechanism that refines initial retrieval results without additional training. This process significantly enhances the retrieval accuracy by considering both image-to-text (I2T) and text-to-image (T2I) retrieval tasks together, thus mitigating the discrepancy between training and testing procedures.

Results and Evaluation

The framework was evaluated on two standard datasets: Flickr30k and MSCOCO. The experiments demonstrated a consistent achievement of state-of-the-art performance with significantly less computational complexity than many existing models. Notably:

On the Flickr30k dataset, the MTFN-RR achieved impressive improvements, especially on R@1 scores, with T2I retrieval performance significantly exceeding prior methods.
On the MSCOCO dataset, both 1k and 5k test sets showed superior performance of MTFN-RR compared to recent models like SCAN and VSE++.

The effectiveness of the MTFN-RR framework is largely attributed to its innovative tensor fusion mechanism and the cross-modal re-ranking strategy, which collectively enhance the retrieval capabilities by leveraging the strengths of both embedding-based and classification-based methods.

Practical and Theoretical Implications

Practically, the MTFN-RR framework offers a robust solution for applications requiring efficient and accurate image-text retrieval, such as digital libraries and online content recommendation systems. The reduction in computational resources and time compared to existing methods makes it a valuable approach for large-scale or real-time systems.

Theoretically, this paper contributes to the understanding of multi-modal data fusion, emphasizing the importance of tensor interactions and the potential of model refinement through cross-modal perspectives. Future research could explore extending this framework to other multi-modal applications or integrating it with more advanced machine learning architectures to enhance flexibility and scalability.

Future Directions

Further developments could explore optimizing the tensor fusion network to handle even more complex data representations and interactions. Additionally, the cross-modal re-ranking approach could be adapted for unsupervised or semi-supervised settings, expanding its applicability to scenarios with limited labeled data. Moreover, integrating advances in deep learning, such as transformers, could potentially improve the model's ability to capture more nuanced semantic relationships between images and text.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Tan Wang (18 papers)
Xing Xu (48 papers)
Yang Yang (884 papers)
Alan Hanjalic (28 papers)
Heng Tao Shen (117 papers)
Jingkuan Song (115 papers)

Citations (136)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - Wangt-CN/MTFN-RR-PyTorch-Code: The offical code for paper "Matching Images and Text with Multi-modal Tensor Fusion and Re-ranking", ACM Multimedia 2019 Oral (66 stars)