LAVCap: LLM-based Audio-Visual Captioning using Optimal Transport (2501.09291v2)

Published 16 Jan 2025 in cs.MM, cs.AI, cs.SD, and eess.AS

Abstract: Automated audio captioning is a task that generates textual descriptions for audio content, and recent studies have explored using visual information to enhance captioning quality. However, current methods often fail to effectively fuse audio and visual data, missing important semantic cues from each modality. To address this, we introduce LAVCap, a LLM-based audio-visual captioning framework that effectively integrates visual information with audio to improve audio captioning performance. LAVCap employs an optimal transport-based alignment loss to bridge the modality gap between audio and visual features, enabling more effective semantic extraction. Additionally, we propose an optimal transport attention module that enhances audio-visual fusion using an optimal transport assignment map. Combined with the optimal training strategy, experimental results demonstrate that each component of our framework is effective. LAVCap outperforms existing state-of-the-art methods on the AudioCaps dataset, without relying on large datasets or post-processing. Code is available at https://github.com/NAVER-INTEL-Co-Lab/gaudi-lavcap.

Summary

The paper introduces LAVCap, a novel framework for audio-visual captioning that effectively fuses audio and visual inputs using Large Language Models and optimal transport.
LAVCap employs optimal transport-based alignment and attention mechanisms to semantically bridge the gap between audio and visual features, improving fusion without large datasets.
Experimental results show LAVCap outperforms state-of-the-art methods on the AudioCaps benchmark, demonstrating improved captioning accuracy and efficiency.

The paper under consideration, titled "LAVCap: LLM-based Audio-Visual Captioning using Optimal Transport," proposes a novel framework for automated audio-visual captioning that utilizes LLMs and optimal transport methods for aligning and integrating audio and visual modalities. The key motivation is to address the inadequacy of existing methods in effectively fusing audio and visual data, which results in missing critical semantic cues.

Framework Overview

LAVCap Framework:

The framework leverages visual information to enhance audio captioning, employing an optimal transport-based alignment loss to reconcile the modality gap between audio and visual features.
An optimal transport attention module is introduced to facilitate audio-visual fusion via an optimal transport assignment map.
The methodology eschews reliance on large datasets or post-processing, focusing instead on optimizing LLM integration through a strategic fusion of audio-visual inputs.

Methodology

Audio-Visual Encoding:

Audio and visual inputs, represented as ( $x_a$ , $x_v$ ), are processed by respective encoders ( $E_a$ , $E_v$ ) into feature representations $h_a$ and $h_v$ with $C$ dimensions.

Optimal Transport-based Alignment:

Modality alignment is conceptualized as an optimal transport (OT) problem. The paper introduces an OT-based alignment loss that forces encoders to align features semantically.
The OT loss optimizes cross-modal token alignments using an assignment map $\mathbf{Q}$ , calculated through the Sinkhorn-Knopp algorithm, to achieve global similarity maximization of cross-modal features.

Fusion and Projection:

The proposed OT-Att module applies OT assignment as an attention weight for fusing features, represented as:

$\hat{h}_a = h_a + \mathbf{Q}^* h_v \quad \text{and} \quad \hat{h}_v = h_v + \mathbf{Q}^{*\top} h_a$

Fused representations are projected to the LLM latent space using a linear projector.

Text Decoding and Training:

LLMs are leveraged for generating captions. The approach adopts a combination of OT loss and cross-entropy loss, optimizing for both semantic feature alignment and text generation fidelity.
Training involves fine-tuning parameters using low-rank adaptation (LoRA), given the data constraints.

Experimental Results

Performance Evaluation:

LAVCap demonstrates superior performance on the AudioCaps benchmark, outperforming state-of-the-art methods with improvements in METEOR, CIDEr, SPICE metrics.
Noteworthy is its success without pre-training on large datasets, indicating efficient use of semantic alignment and fusion mechanisms.

Ablation Studies:

The paper reaffirmed the significance of OT loss in bridging modality gaps and optimizing multi-modal contexts. The OT attention module provided superior audio-visual fusion efficacy compared to standard methods like cross-attention.
The need for tailored instruction prompts for LLM decoders was highlighted, influencing final model performance.

Qualitative Analysis:

Comparative evaluations show that models integrating both audio and visual inputs generate more descriptive and contextually accurate captions.

Conclusion

LAVCap stands out by effectively incorporating visual modality into audio captioning through optimal transport strategies, resulting in semantic-rich alignments and enhanced captioning accuracy. This work uniquely showcases the potential of optimal transport combined with LLMs in multimodal learning scenarios.

PDF Markdown

Related Papers

GitHub

GitHub - NAVER-INTEL-Co-Lab/gaudi-lavcap