Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A SOUND APPROACH: Using Large Language Models to generate audio descriptions for egocentric text-audio retrieval (2402.19106v1)

Published 29 Feb 2024 in eess.AS, cs.IR, and cs.SD
A SOUND APPROACH: Using Large Language Models to generate audio descriptions for egocentric text-audio retrieval

Abstract: Video databases from the internet are a valuable source of text-audio retrieval datasets. However, given that sound and vision streams represent different "views" of the data, treating visual descriptions as audio descriptions is far from optimal. Even if audio class labels are present, they commonly are not very detailed, making them unsuited for text-audio retrieval. To exploit relevant audio information from video-text datasets, we introduce a methodology for generating audio-centric descriptions using LLMs. In this work, we consider the egocentric video setting and propose three new text-audio retrieval benchmarks based on the EpicMIR and EgoMCQ tasks, and on the EpicSounds dataset. Our approach for obtaining audio-centric descriptions gives significantly higher zero-shot performance than using the original visual-centric descriptions. Furthermore, we show that using the same prompts, we can successfully employ LLMs to improve the retrieval on EpicSounds, compared to using the original audio class labels of the dataset. Finally, we confirm that LLMs can be used to determine the difficulty of identifying the action associated with a sound.

Enhancing Egocentric Text-Audio Retrieval with LLMs

Introduction to the Study

The pervasive growth of audio and video content online underscores the importance of advanced retrieval systems. This paper addresses the critical challenge of efficiently searching through such content by leveraging LLMs to generate audio-centric descriptions from visual-centric ones for improved text-audio retrieval. The research introduces a novel methodology that maps visual descriptions to audio descriptions, enhancing retrieval performance significantly. The approach is validated through the creation and evaluation of three benchmarks within an egocentric video context, demonstrating the utility of LLMs in bridging the gap between visual and audio data modalities.

Literature Review and Background

The paper situates itself at the intersection of text-audio retrieval and the application of LLMs in multimodal tasks, drawing from advancements in transformer-based models and the integration of LLMs like ChatGPT in understanding visual and audio languages. The paper stands out by focusing on leveraging visual-centric datasets for audio description generation, a contrast to previous works emphasizing audio-centric datasets. Furthermore, it contributes to the burgeoning interest in egocentric (first-person viewpoint) data exploitation for varied tasks, including but not limited to action recognition and video retrieval, now extending into the audio domain.

Methodology and Datasets

The researchers utilized several existing datasets, such as EpicKitchens and Ego4D for egocentric visuals and EpicSounds for audio classification. The novel methodology involves conditioning LLMs with paired examples of visual and audio descriptions to generate new, audio-centric descriptions. This few-shot learning approach leverages samples from Kinetics700-2020 and AudioCaps datasets, demonstrating the LLM's capability to contextualize and translate between modalities effectively.

Research Contributions and Findings

New Benchmarks and Audio Description Generation

By applying the methodology to the EpicMIR, EgoMCQ tasks, and the EpicSounds dataset, the paper curates three novel benchmarks for egocentric text-audio retrieval. These benchmarks, coupled with LLM-generated descriptions, showcase significantly improved zero-shot retrieval performance. Notably, the approach excels in generating more effective descriptors than the original audio class labels provided in the datasets.

Evaluating Audio Relevancy

An intriguing aspect of the research is the determination of audio relevancy using LLMs. The authors demonstrate LLMs' capability to categorize audio samples based on their informativeness, enabling the filtering out of non-informative audio content from datasets. This aspect has practical implications for dataset curation and highlights the multifaceted utility of LLMs in retrieval tasks.

Evaluation and Implications

The evaluation through zero-shot egocentric text-audio retrieval across the newly developed benchmarks underlines the efficacy of the proposed methodology. Notably, the utilization of pre-trained models like LAION-Clap and WavCaps in a zero-shot setting emphasizes the robustness and adaptability of the generated audio descriptions.

Conclusion and Future Directions

This paper makes significant strides in enhancing text-audio retrieval, especially in the context of egocentric video data. The demonstrated capability of LLMs to generate meaningful, audio-centric descriptions from visual descriptions presents a promising avenue for improving multimodal retrieval systems. As the field progresses, the methodologies and insights from this research could be extended beyond egocentric datasets, paving the way for broader applications in text-audio understanding and retrieval. The findings underscore the potential of LLMs to transform and expedite the search capabilities across a spectrum of multimedia content, suggesting a fertile ground for future explorations in AI-driven multimodal interactions.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (32)
  1. OpenAI, “Chatgpt,” https://openai.com/blog/chatgpt, Accessed July, August 2023.
  2. “Llama 2: Open Foundation and Fine-Tuned Chat Models,” arXiv:2307.09288, 2023.
  3. “A short note on the kinetics-700-2020 human action dataset,” arXiv:2010.10864, 2020.
  4. “Audiocaps: Generating captions for audios in the wild,” in Proc. NACCL, 2019.
  5. “Scaling egocentric vision: The epic-kitchens dataset,” in ECCV, 2018.
  6. “Ego4d: Around the world in 3,000 hours of egocentric video,” in CVPR, 2022.
  7. “Audio retrieval with natural language queries,” in INTERSPEECH, 2021.
  8. “Audio retrieval with natural language queries: A benchmark study,” IEEE Transactions on Multimedia, 2022.
  9. M. Slaney, “Semantic-audio retrieval,” in ICASSP, 2002.
  10. “Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation,” in ICASSP, 2023.
  11. OpenAI, “Gpt-4 technical report,” arXiv:2303.08774, 2023.
  12. “Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality,” 2023.
  13. “Llama: Open and efficient foundation language models,” arXiv:2302.13971, 2023.
  14. “Minigpt-4: Enhancing vision-language understanding with advanced large language models,” arXiv:2304.10592, 2023.
  15. “Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface,” arXiv:2303.17580, 2023.
  16. “Multi-modal classifiers for open-vocabulary object detection,” in ICML, 2023.
  17. S. Menon and C. Vondrick, “Visual classification via description from large language models,” in ICLR, 2023.
  18. “Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research,” arXiv:2303.17395, 2023.
  19. “Learning state-aware visual representations from audible interactions,” in NeurIPS, 2022.
  20. “Domain generalization through audio-visual relative norm alignment in first person action recognition,” in WACV, 2022.
  21. “Egocentric video-language pretraining,” in NeurIPS, 2022.
  22. “Learning video representations from large language models,” in CVPR, 2023.
  23. “Hiervl: Learning hierarchical video-language embeddings,” in CVPR, 2023.
  24. “Egovlpv2: Egocentric video-language pre-training with fusion in the backbone,” in ICCV, 2023.
  25. “Epic-fusion: Audio-visual temporal binding for egocentric action recognition,” in ICCV, 2019.
  26. “EPIC-SOUNDS: A Large-Scale Dataset of Actions that Sound,” in ICASSP, 2023.
  27. “Rescaling egocentric vision,” IJCV, 2022.
  28. K. Järvelin and J. Kekäläinen, “Cumulated gain-based evaluation of ir techniques,” ACM Trans. Inf. Syst., 2002.
  29. “Hts-at: A hierarchical token-semantic audio transformer for sound classification and detection,” in ICASSP, 2022.
  30. “Clotho: An audio captioning dataset,” in ICASSP, 2020.
  31. “Fine-grained action retrieval through multiple parts-of-speech embeddings,” in ICCV, 2019.
  32. Show Lab, “Egovlp: A repository for ego-centric vision language pre-training,” https://github.com/showlab/EgoVLP, 2023, Accessed: 2023-01-10.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Andreea-Maria Oncescu (5 papers)
  2. João F. Henriques (55 papers)
  3. Andrew Zisserman (248 papers)
  4. Samuel Albanie (81 papers)
  5. A. Sophia Koepke (22 papers)
Citations (4)