An Overview of Cross-media Retrieval: Concepts, Methodologies, Benchmarks, and Challenges
The reviewed paper provides a comprehensive examination of cross-media retrieval, which seeks to bridge the "media gap" by enabling the retrieval of diverse media types using a single query. This process is essential for managing the vast amounts of multimedia data generated across text, images, video, audio, and 3D models. The authors identify cross-media retrieval as an emerging research area and aim to clarify its underlying concepts, methodologies, and benchmarks, while also delineating the challenges that must be addressed.
This detailed survey organizes existing methodologies into two main categories: common space learning and cross-media similarity measurement. Common space learning techniques focus on transforming heterogeneous media into a unified representation space. These include traditional statistical correlation methods like Canonical Correlation Analysis (CCA), its nonlinear extensions, and deep neural networks (DNN) approaches, which leverage their non-linear mapping capabilities to enhance cross-media retrieval accuracy.
DNN-based methods stand out due to their ability to abstract complex non-linear relationships among media types. Papers have experimented with architectures like bimodal deep autoencoders and variants like Deep CCA (DCCA), demonstrating the effectiveness of DNNs, albeit often restricted to pairs of media types. Full end-to-end architectures that incorporate diverse media inputs remain an area for future exploration, promising further performance gains.
Cross-media similarity measurement methods attempt to direct compute similarities without explicit feature-space transformation. Graph-based approaches model data and the co-existing associations within rich structures, utilizing techniques such as similarity propagation in constructed data graphs. However, these are computationally intensive, often reliant on the availability of intra-media and inter-media relationships, and face challenges when scaling to large datasets.
The authors have also highlighted the importance of suitable datasets for benchmarking, noting the limitations of existing publicly available datasets which are typically small and constrained in the number of media types. The XMedia dataset introduced by the authors aims to address some of these concerns, offering a more comprehensive platform that includes up to five media types.
Experimental evaluations were conducted on representative datasets using metric indicators like Mean Average Precision (MAP) and Discounted Cumulative Gain (DCG). Results revealed variations in performance effectiveness and efficiency across methodologies, with DNN-based methods showing potential for higher accuracy, especially when paired with advanced feature representations like CNN features for images.
Looking towards future advancements, the paper identifies critical challenges: improving dataset quality, enhancing both retrieval accuracy and efficiency, expanding the application of DNNs, and more effectively exploiting context correlation information inherent within cross-media content. Addressing these will be crucial for developing robust, scalable cross-media retrieval systems capable of supporting complex real-world applications. As cross-media retrieval evolves, it is set to have significant impacts on how multimedia data is accessed and utilized, highlighting the continued importance of research and innovation in this space.