Clotho: An Audio Captioning Dataset
The paper presents "Clotho," a dataset designed specifically for the task of audio captioning. This task involves generating natural language descriptions of general audio content, which poses unique challenges distinct from other captioning tasks like image captioning due to the ambiguities associated with audio perception and interpretation. Here, systems are required to capture human-perceived information such as sound events, acoustic scenes, and spatial relationships without relying on visual cues.
The Clotho dataset comprises 4,981 audio samples, each ranging from 15 to 30 seconds in length, sourced from the Freesound platform. Each audio clip is associated with five captions, produced via a rigorous crowdsourcing protocol to ensure diversity and accuracy. This collection results in a total of 24,905 free-text captions, each containing between eight and 20 words.
Clotho addresses several limitations observed in existing datasets, such as Audio Caption and AudioCaps. For instance, Clotho avoids perceptual bias by ensuring annotators rely solely on audio cues without supplemental contextual information like video or predefined tags. Moreover, the dataset enhances the diversity of descriptions by allowing multiple captions per audio clip, which provides a more comprehensive foundation for training and evaluating audio captioning models.
A significant methodological strength of this dataset is the focus on eliminating unique and named entity words to ensure the distribution of word appearances in the data is conducive to both training and evaluation. Each audio sample exists in a unique split (development, evaluation, or testing) with split validation to avoid vocabulary leak between training and evaluation/testing sets. This methodological rigor addresses potential pitfalls in machine learning tasks arising from sparse vocabulary issues.
Baseline evaluations in the paper use an encoder-decoder model with attention, previously employed in audio captioning research. The results, evaluated with standard metrics such as BLEU, METEOR, CIDEr, and ROUGE, establish a performance benchmark. Metrics such as BLEU\textsubscript{1} demonstrate the model's ability to recognize relevant words related to the audio content, though sentence structure and fluency leave room for improvement. This insight suggests that the Clotho dataset is effective for research, serving as a foundation for developing more sophisticated models in the future.
The introduction of Clotho is likely to drive theoretical advancements within audio captioning by providing a resource that improves on prior datasets in terms of diversity and methodological rigor. Practically, developments in audio captioning may find applications in assistive technologies, automated content indexing, and multimedia search and retrieval systems.
Future work will likely focus on refining audio captioning techniques, considering challenges such as enhancing grammatical coherence and structuring of generated sentences. The dataset invites renewed exploration into methods that blend signal processing with advanced natural language processing to bridge the gap between audio and textual modalities.