- The paper introduces LAVCap, a novel framework for audio-visual captioning that effectively fuses audio and visual inputs using Large Language Models and optimal transport.
- LAVCap employs optimal transport-based alignment and attention mechanisms to semantically bridge the gap between audio and visual features, improving fusion without large datasets.
- Experimental results show LAVCap outperforms state-of-the-art methods on the AudioCaps benchmark, demonstrating improved captioning accuracy and efficiency.
The paper under consideration, titled "LAVCap: LLM-based Audio-Visual Captioning using Optimal Transport," proposes a novel framework for automated audio-visual captioning that utilizes LLMs and optimal transport methods for aligning and integrating audio and visual modalities. The key motivation is to address the inadequacy of existing methods in effectively fusing audio and visual data, which results in missing critical semantic cues.
Framework Overview
LAVCap Framework:
- The framework leverages visual information to enhance audio captioning, employing an optimal transport-based alignment loss to reconcile the modality gap between audio and visual features.
- An optimal transport attention module is introduced to facilitate audio-visual fusion via an optimal transport assignment map.
- The methodology eschews reliance on large datasets or post-processing, focusing instead on optimizing LLM integration through a strategic fusion of audio-visual inputs.
Methodology
Audio-Visual Encoding:
- Audio and visual inputs, represented as (xa, xv), are processed by respective encoders (Ea, Ev) into feature representations ha and hv with C dimensions.
Optimal Transport-based Alignment:
- Modality alignment is conceptualized as an optimal transport (OT) problem. The paper introduces an OT-based alignment loss that forces encoders to align features semantically.
- The OT loss optimizes cross-modal token alignments using an assignment map Q, calculated through the Sinkhorn-Knopp algorithm, to achieve global similarity maximization of cross-modal features.
Fusion and Projection:
- The proposed OT-Att module applies OT assignment as an attention weight for fusing features, represented as:
h^a=ha+Q∗hvandh^v=hv+Q∗⊤ha
- Fused representations are projected to the LLM latent space using a linear projector.
Text Decoding and Training:
- LLMs are leveraged for generating captions. The approach adopts a combination of OT loss and cross-entropy loss, optimizing for both semantic feature alignment and text generation fidelity.
- Training involves fine-tuning parameters using low-rank adaptation (LoRA), given the data constraints.
Experimental Results
Performance Evaluation:
- LAVCap demonstrates superior performance on the AudioCaps benchmark, outperforming state-of-the-art methods with improvements in METEOR, CIDEr, SPICE metrics.
- Noteworthy is its success without pre-training on large datasets, indicating efficient use of semantic alignment and fusion mechanisms.
Ablation Studies:
- The paper reaffirmed the significance of OT loss in bridging modality gaps and optimizing multi-modal contexts. The OT attention module provided superior audio-visual fusion efficacy compared to standard methods like cross-attention.
- The need for tailored instruction prompts for LLM decoders was highlighted, influencing final model performance.
Qualitative Analysis:
- Comparative evaluations show that models integrating both audio and visual inputs generate more descriptive and contextually accurate captions.
Conclusion
LAVCap stands out by effectively incorporating visual modality into audio captioning through optimal transport strategies, resulting in semantic-rich alignments and enhanced captioning accuracy. This work uniquely showcases the potential of optimal transport combined with LLMs in multimodal learning scenarios.