Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Study of positional encoding approaches for Audio Spectrogram Transformers (2110.06999v1)

Published 13 Oct 2021 in cs.SD, cs.LG, and eess.AS

Abstract: Transformers have revolutionized the world of deep learning, specially in the field of natural language processing. Recently, the Audio Spectrogram Transformer (AST) was proposed for audio classification, leading to state of the art results in several datasets. However, in order for ASTs to outperform CNNs, pretraining with ImageNet is needed. In this paper, we study one component of the AST, the positional encoding, and propose several variants to improve the performance of ASTs trained from scratch, without ImageNet pretraining. Our best model, which incorporates conditional positional encodings, significantly improves performance on Audioset and ESC-50 compared to the original AST.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Leonardo Pepino (11 papers)
  2. Pablo Riera (11 papers)
  3. Luciana Ferrer (33 papers)
Citations (5)

Summary

We haven't generated a summary for this paper yet.