Prototype-based Aleatoric Uncertainty Quantification for Cross-modal Retrieval (2309.17093v3)

Published 29 Sep 2023 in cs.CV

Abstract: Cross-modal Retrieval methods build similarity relations between vision and language modalities by jointly learning a common representation space. However, the predictions are often unreliable due to the Aleatoric uncertainty, which is induced by low-quality data, e.g., corrupt images, fast-paced videos, and non-detailed texts. In this paper, we propose a novel Prototype-based Aleatoric Uncertainty Quantification (PAU) framework to provide trustworthy predictions by quantifying the uncertainty arisen from the inherent data ambiguity. Concretely, we first construct a set of various learnable prototypes for each modality to represent the entire semantics subspace. Then Dempster-Shafer Theory and Subjective Logic Theory are utilized to build an evidential theoretical framework by associating evidence with Dirichlet Distribution parameters. The PAU model induces accurate uncertainty and reliable predictions for cross-modal retrieval. Extensive experiments are performed on four major benchmark datasets of MSR-VTT, MSVD, DiDeMo, and MS-COCO, demonstrating the effectiveness of our method. The code is accessible at https://github.com/leolee99/PAU.

References (67)

Citations (5)

View on Semantic Scholar

Summary

The paper presents PAU, which quantifies aleatoric uncertainty using learnable prototypes and Dempster-Shafer Theory to improve retrieval reliability.
It employs uncertainty and diversity loss functions along with a re-ranking strategy to handle ambiguous and noisy data effectively.
Experiments on benchmarks like MSR-VTT and MS-COCO demonstrate significant performance gains over state-of-the-art methods in cross-modal retrieval.

This paper introduces a novel framework called Prototype-based Aleatoric Uncertainty Quantification (PAU) tailored for cross-modal retrieval tasks. By addressing the aleatoric uncertainty caused by inherent data quality issues, PAU aims to improve the reliability of predictions in vision and language similarity estimation.

Introduction

The challenge of cross-modal retrieval lies in transforming distinct modality representations into a common embedding space to evaluate similarity effectively. Traditional methods often overlook the quality of input data, which can vary significantly, leading to unreliable predictions. This research focuses on quantifying aleatoric uncertainty, particularly derived from ambiguous multi-modal data such as fast-paced videos and non-detailed texts. Such data induce uncertainty as indicated by high information entropy due to multiple potential semantic matches (Figure 1).

Figure 1: Illustration of confused matching in fast-paced videos and non-detailed texts. Assuming the possible semantics of each modal subspace are finite with $K$ categories. (a) A single-scene Video A can only match one semantics of "talking". By contrast, a multi-scene Video B can match to 3 semantics of "talking", "shadow", and "cave". (b) Text A can only match the left video, while Text B with some details removed (in red) matches both videos.

Methodology

Framework Overview

PAU leverages a series of learnable prototypes, each representing semantic categories within a modality subspace. These prototypes facilitate the evaluation of data ambiguity. The system employs the Dempster-Shafer Theory (DST) to model uncertainty by associating evidence with the Dirichlet distribution parameters (Figure 2).

Figure 2: The Framework of PAU. The visual encoder $\phi_v$ and textual encoder $\phi_t$ separately map the visual and textual instances into a joint embedding space to calculate the similarity matrix $M$ . A dot product function is used to build a set of similarity vector $\mathbf{P}$ .

Uncertainty Quantification

For each modality, $K$ prototypes are constructed. The similarity between an instance and these prototypes informs belief masses in the DST framework. The aleatoric uncertainty is computed as $u = 1 - \psi$ , where $\psi$ represents the certainty mass. The evidence $e_k$ of an instance semantically matching a prototype is derived from cosine similarity, feeding into the DST to gauge overall data uncertainty.

Training and Diversity Loss

Prototypes are initialized following Xavier's method, promoting diverse semantic coverage across the subspace. Two primary losses are considered:

Uncertainty Loss: Aligns the instance's uncertainty with its mean similarity to strengthen semantic representation.
Diversity Loss: Ensures that prototypes represent disjoint semantics, satisfying mutual exclusivity prerequisite for DST.

Re-ranking is applied post-training to adjust predictions by inversely weighting similarity scores with uncertainty, enhancing prediction reliability.

Experiments and Results

Extensive experiments conducted on benchmarks such as MSR-VTT, MSVD, DiDeMo, and MS-COCO highlighted PAU's effectiveness. The framework demonstrated performance improvements in both text-to-video and video-to-text retrieval tasks over previous methods with significant retrieval score gains presented in Tables 1 to 4.

Comparison with Other Methods

PAU was benchmarked against recent state-of-the-art approaches, including VSE $\infty$ , PCME, and transformer-based models, showing consistent performance boosts especially in scenarios with high ambiguity.

Robustness in Noisy Conditions

PAU's robustness was further tested with artificially introduced correspondence noise, where it showed significant resilience and outperformed existing models like PCME in scenarios with 20% to 50% noise (Figure 3).

Figure 3: The performance changes comparison after removing top-r instances with the highest uncertainty scores quantified by PCME and PAU on MS-COCO.

Conclusion

The Prototype-based Aleatoric Uncertainty Quantification (PAU) framework effectively enhances prediction reliability in cross-modal retrieval by addressing inherent data ambiguities. With its novel approach to uncertainty quantification, PAU not only provides dependable predictions but also paves the way for future advancements in multimedia information retrieval systems. Its adaptability for tasks with varying data quality underscores its potential for broader applications in AI-driven multi-modal interactions.