Self-Supervised Adversarial Hashing Networks for Cross-Modal Retrieval (1804.01223v1)

Published 4 Apr 2018 in cs.CV

Abstract: Thanks to the success of deep learning, cross-modal retrieval has made significant progress recently. However, there still remains a crucial bottleneck: how to bridge the modality gap to further enhance the retrieval accuracy. In this paper, we propose a self-supervised adversarial hashing (\textbf{SSAH}) approach, which lies among the early attempts to incorporate adversarial learning into cross-modal hashing in a self-supervised fashion. The primary contribution of this work is that two adversarial networks are leveraged to maximize the semantic correlation and consistency of the representations between different modalities. In addition, we harness a self-supervised semantic network to discover high-level semantic information in the form of multi-label annotations. Such information guides the feature learning process and preserves the modality relationships in both the common semantic space and the Hamming space. Extensive experiments carried out on three benchmark datasets validate that the proposed SSAH surpasses the state-of-the-art methods.

Authors (6)

Chao Li (430 papers)
Cheng Deng (67 papers)
Ning Li (174 papers)
Wei Liu (1136 papers)
Xinbo Gao (194 papers)
Dacheng Tao (830 papers)

Citations (344)

View on Semantic Scholar

Summary

The paper proposes a self-supervised adversarial hashing method that bridges the modality gap by enforcing semantic consistency across image and text data.
It employs dual adversarial networks and a self-supervised semantic module to enhance feature learning in both semantic and Hamming spaces.
Experimental results on MIRFLICKR-25K, NUS-WIDE, and MS COCO benchmarks show up to a 10% increase in MAP, outperforming state-of-the-art models.

Self-Supervised Adversarial Hashing Networks for Cross-Modal Retrieval

The paper presents a method termed Self-Supervised Adversarial Hashing (SSAH) aimed at enhancing cross-modal retrieval. This framework endeavors to mitigate the challenges posed by the modality gap, through the innovative incorporation of adversarial learning into hashing, facilitated by self-supervised mechanisms.

The authors leverage two adversarial networks with the intent of enhancing semantic correlation across varying modalities, specifically images and text. By enforcing a high degree of semantic congruence and consistency through adversarial learning, the networks maximize the retrieval accuracy. Furthermore, a self-supervised semantic network is articulated to explore high-level semantic contexts, particularly in multi-label annotations, which guide feature learning in both the common semantic space and the Hamming space.

The robustness of SSAH is demonstrated through comprehensive experiments on three benchmark datasets, namely MIRFLICKR-25K, NUS-WIDE, and MS COCO. The results indicate a significant performance enhancement over existing state-of-the-art methods, particularly on the MIRFLICKR-25K dataset, with notable improvements in mean average precision (MAP) for both I $\rightarrow$ T and T $\rightarrow$ I tasks.

Key Contributions and Methods

Adversarial Networks: The adoption of adversarial networks in SSAH serves to refine the SEMANTIC relevance and representation consistency between different modalities. Such a configuration is instrumental considering the heterogeneous nature of cross-modal data.
Self-Supervised Semantic Network: The use of a self-supervised semantic framework ensures that feature learning is not only guided by modality-specific characteristics but is also informed by high-level semantic information garnered from multi-label data. This approach contrasts with conventional practices of cross-modal hashing which predominantly rely on single-class labels, thereby limiting the semantic gauge.
High-Quality Hash Representation: Through SSAH, the learned hash codes in the binary Hamming space manage to capture inherent modality consistency more effectively than previously reported methods. This point of efficiency is crucial for real-world applications necessitating rapid retrieval in large datasets.
Experimentation and Comparison: The thorough experimental analysis highlights the superiority of SSAH over both traditional and deep-learning-based hashing approaches. Notably, the SSAH framework consistently surpasses the compared baselines across diverse dataset configurations, achieving up to a 10% increase in MAP scores.

Implications and Future Directions

The implications of SSAH are significant within the domain of cross-modal retrieval, heralding improved retrieval performance in multimedia applications where quick and semantically accurate data retrieval is crucial. The methodology not only provides robustness but also demonstrates scalability due to its reliance on deep-learning principles combined with the efficiency of hashing.

Looking forward, the development and integration of more intricate adversarial setups that can further minimize modality discrepancies could be pursued to enhance performance. The role of alternative self-supervised mechanisms that encompass different data natures and formats could be another area of exploration. A paradigm shift from supervised learning towards more autonomous models utilizing reinforcement learning or unsupervised learning for similar tasks could also have profound implications on the field, potentially shrinking computation time and improving precision.

Conclusion

The SSAH approach offers a significant leap forward in the field of cross-modal retrieval by elegantly integrating elements of self-supervised learning and adversarial training within the cross-modal hashing paradigm. The paper convincingly argues for and demonstrates the need for such integration, providing a pathway for further advancements in the retrieval domain through cohesive semantic acquisition and representation learning.

PDF Markdown