Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

WaveTransformer: A Novel Architecture for Audio Captioning Based on Learning Temporal and Time-Frequency Information (2010.11098v1)

Published 21 Oct 2020 in cs.SD, cs.LG, cs.MM, and eess.AS

Abstract: Automated audio captioning (AAC) is a novel task, where a method takes as an input an audio sample and outputs a textual description (i.e. a caption) of its contents. Most AAC methods are adapted from from image captioning of machine translation fields. In this work we present a novel AAC novel method, explicitly focused on the exploitation of the temporal and time-frequency patterns in audio. We employ three learnable processes for audio encoding, two for extracting the local and temporal information, and one to merge the output of the previous two processes. To generate the caption, we employ the widely used Transformer decoder. We assess our method utilizing the freely available splits of Clotho dataset. Our results increase previously reported highest SPIDEr to 17.3, from 16.2.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. An Tran (5 papers)
  2. Konstantinos Drossos (44 papers)
  3. Tuomas Virtanen (112 papers)
Citations (19)

Summary

We haven't generated a summary for this paper yet.