Clotho: An Audio Captioning Dataset (1910.09387v1)

Published 21 Oct 2019 in cs.SD, cs.CL, cs.LG, and eess.AS

Abstract: Audio captioning is the novel task of general audio content description using free text. It is an intermodal translation task (not speech-to-text), where a system accepts as an input an audio signal and outputs the textual description (i.e. the caption) of that signal. In this paper we present Clotho, a dataset for audio captioning consisting of 4981 audio samples of 15 to 30 seconds duration and 24 905 captions of eight to 20 words length, and a baseline method to provide initial results. Clotho is built with focus on audio content and caption diversity, and the splits of the data are not hampering the training or evaluation of methods. All sounds are from the Freesound platform, and captions are crowdsourced using Amazon Mechanical Turk and annotators from English speaking countries. Unique words, named entities, and speech transcription are removed with post-processing. Clotho is freely available online (https://zenodo.org/record/3490684).

Authors (3)

Konstantinos Drossos (44 papers)
Samuel Lipping (5 papers)
Tuomas Virtanen (112 papers)

Citations (338)

View on Semantic Scholar

Summary

Clotho: An Audio Captioning Dataset

The paper presents "Clotho," a dataset designed specifically for the task of audio captioning. This task involves generating natural language descriptions of general audio content, which poses unique challenges distinct from other captioning tasks like image captioning due to the ambiguities associated with audio perception and interpretation. Here, systems are required to capture human-perceived information such as sound events, acoustic scenes, and spatial relationships without relying on visual cues.

The Clotho dataset comprises 4,981 audio samples, each ranging from 15 to 30 seconds in length, sourced from the Freesound platform. Each audio clip is associated with five captions, produced via a rigorous crowdsourcing protocol to ensure diversity and accuracy. This collection results in a total of 24,905 free-text captions, each containing between eight and 20 words.

Clotho addresses several limitations observed in existing datasets, such as Audio Caption and AudioCaps. For instance, Clotho avoids perceptual bias by ensuring annotators rely solely on audio cues without supplemental contextual information like video or predefined tags. Moreover, the dataset enhances the diversity of descriptions by allowing multiple captions per audio clip, which provides a more comprehensive foundation for training and evaluating audio captioning models.

A significant methodological strength of this dataset is the focus on eliminating unique and named entity words to ensure the distribution of word appearances in the data is conducive to both training and evaluation. Each audio sample exists in a unique split (development, evaluation, or testing) with split validation to avoid vocabulary leak between training and evaluation/testing sets. This methodological rigor addresses potential pitfalls in machine learning tasks arising from sparse vocabulary issues.

Baseline evaluations in the paper use an encoder-decoder model with attention, previously employed in audio captioning research. The results, evaluated with standard metrics such as BLEU, METEOR, CIDEr, and ROUGE, establish a performance benchmark. Metrics such as BLEU\textsubscript{1} demonstrate the model's ability to recognize relevant words related to the audio content, though sentence structure and fluency leave room for improvement. This insight suggests that the Clotho dataset is effective for research, serving as a foundation for developing more sophisticated models in the future.

The introduction of Clotho is likely to drive theoretical advancements within audio captioning by providing a resource that improves on prior datasets in terms of diversity and methodological rigor. Practically, developments in audio captioning may find applications in assistive technologies, automated content indexing, and multimedia search and retrieval systems.

Future work will likely focus on refining audio captioning techniques, considering challenges such as enhancing grammatical coherence and structuring of generated sentences. The dataset invites renewed exploration into methods that blend signal processing with advanced natural language processing to bridge the gap between audio and textual modalities.

PDF Markdown

Related Papers

Find Related Papers