Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

An Overview of Cross-media Retrieval: Concepts, Methodologies, Benchmarks and Challenges (1704.02223v4)

Published 7 Apr 2017 in cs.MM

Abstract: Multimedia retrieval plays an indispensable role in big data utilization. Past efforts mainly focused on single-media retrieval. However, the requirements of users are highly flexible, such as retrieving the relevant audio clips with one query of image. So challenges stemming from the "media gap", which means that representations of different media types are inconsistent, have attracted increasing attention. Cross-media retrieval is designed for the scenarios where the queries and retrieval results are of different media types. As a relatively new research topic, its concepts, methodologies and benchmarks are still not clear in the literatures. To address these issues, we review more than 100 references, give an overview including the concepts, methodologies, major challenges and open issues, as well as build up the benchmarks including datasets and experimental results. Researchers can directly adopt the benchmarks to promptly evaluate their proposed methods. This will help them to focus on algorithm design, rather than the time-consuming compared methods and results. It is noted that we have constructed a new dataset XMedia, which is the first publicly available dataset with up to five media types (text, image, video, audio and 3D model). We believe this overview will attract more researchers to focus on cross-media retrieval and be helpful to them.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Yuxin Peng (65 papers)
  2. Xin Huang (222 papers)
  3. Yunzhen Zhao (4 papers)
Citations (273)

Summary

An Overview of Cross-media Retrieval: Concepts, Methodologies, Benchmarks, and Challenges

The reviewed paper provides a comprehensive examination of cross-media retrieval, which seeks to bridge the "media gap" by enabling the retrieval of diverse media types using a single query. This process is essential for managing the vast amounts of multimedia data generated across text, images, video, audio, and 3D models. The authors identify cross-media retrieval as an emerging research area and aim to clarify its underlying concepts, methodologies, and benchmarks, while also delineating the challenges that must be addressed.

This detailed survey organizes existing methodologies into two main categories: common space learning and cross-media similarity measurement. Common space learning techniques focus on transforming heterogeneous media into a unified representation space. These include traditional statistical correlation methods like Canonical Correlation Analysis (CCA), its nonlinear extensions, and deep neural networks (DNN) approaches, which leverage their non-linear mapping capabilities to enhance cross-media retrieval accuracy.

DNN-based methods stand out due to their ability to abstract complex non-linear relationships among media types. Papers have experimented with architectures like bimodal deep autoencoders and variants like Deep CCA (DCCA), demonstrating the effectiveness of DNNs, albeit often restricted to pairs of media types. Full end-to-end architectures that incorporate diverse media inputs remain an area for future exploration, promising further performance gains.

Cross-media similarity measurement methods attempt to direct compute similarities without explicit feature-space transformation. Graph-based approaches model data and the co-existing associations within rich structures, utilizing techniques such as similarity propagation in constructed data graphs. However, these are computationally intensive, often reliant on the availability of intra-media and inter-media relationships, and face challenges when scaling to large datasets.

The authors have also highlighted the importance of suitable datasets for benchmarking, noting the limitations of existing publicly available datasets which are typically small and constrained in the number of media types. The XMedia dataset introduced by the authors aims to address some of these concerns, offering a more comprehensive platform that includes up to five media types.

Experimental evaluations were conducted on representative datasets using metric indicators like Mean Average Precision (MAP) and Discounted Cumulative Gain (DCG). Results revealed variations in performance effectiveness and efficiency across methodologies, with DNN-based methods showing potential for higher accuracy, especially when paired with advanced feature representations like CNN features for images.

Looking towards future advancements, the paper identifies critical challenges: improving dataset quality, enhancing both retrieval accuracy and efficiency, expanding the application of DNNs, and more effectively exploiting context correlation information inherent within cross-media content. Addressing these will be crucial for developing robust, scalable cross-media retrieval systems capable of supporting complex real-world applications. As cross-media retrieval evolves, it is set to have significant impacts on how multimedia data is accessed and utilized, highlighting the continued importance of research and innovation in this space.