Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
112 tokens/sec
GPT-4o
8 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Learning Temporal Resolution in Spectrogram for Audio Classification (2210.01719v3)

Published 4 Oct 2022 in cs.SD, cs.AI, cs.MM, eess.AS, and eess.SP

Abstract: The audio spectrogram is a time-frequency representation that has been widely used for audio classification. One of the key attributes of the audio spectrogram is the temporal resolution, which depends on the hop size used in the Short-Time Fourier Transform (STFT). Previous works generally assume the hop size should be a constant value (e.g., 10 ms). However, a fixed temporal resolution is not always optimal for different types of sound. The temporal resolution affects not only classification accuracy but also computational cost. This paper proposes a novel method, DiffRes, that enables differentiable temporal resolution modeling for audio classification. Given a spectrogram calculated with a fixed hop size, DiffRes merges non-essential time frames while preserving important frames. DiffRes acts as a "drop-in" module between an audio spectrogram and a classifier and can be jointly optimized with the classification task. We evaluate DiffRes on five audio classification tasks, using mel-spectrograms as the acoustic features, followed by off-the-shelf classifier backbones. Compared with previous methods using the fixed temporal resolution, the DiffRes-based method can achieve the equivalent or better classification accuracy with at least 25% computational cost reduction. We further show that DiffRes can improve classification accuracy by increasing the temporal resolution of input acoustic features, without adding to the computational cost.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (61)
  1. Implicit regularization in deep matrix factorization. Advances in Neural Information Processing Systems, 32.
  2. Online real-time onset detection with recurrent neural networks. In Proceedings of the International Conference on Digital Audio Effects.
  3. Codified audio language modeling learns useful representations for music information retrieval. arxiv:2107.05677.
  4. A Handbook of Fourier Theorems. Cambridge University Press.
  5. Vggsound: A large-scale audio-visual dataset. In IEEE International Conference on Acoustics, Speech and Signal Processing, 721–725. IEEE.
  6. HTS-AT: A hierarchical token-semantic audio transformer for sound classification and detection. In IEEE International Conference on Acoustics, Speech and Signal Processing, 646–650.
  7. A tutorial on the cross-entropy method. Annals of Operations Research, 134(1): 19–67.
  8. Neural audio synthesis of musical notes with WaveNet autoencoders. In International Conference on Machine Learning, 1068–1077. PMLR.
  9. How low can you go? Reducing frequency and time resolution in current CNN architectures for music auto-tagging. In IEEE European Signal Processing Conference, 131–135.
  10. FSD50K: An open dataset of human-labeled sound events. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30: 829–852.
  11. Neocognitron: A self-organizing neural network model for a mechanism of visual pattern recognition. In Competition and Cooperation in Neural Nets, 267–285. Springer.
  12. Gabor, D. 1946. Theory of communication. Part 1: The analysis of information. Journal of the Institution of Electrical Engineers, 93(26): 429–441.
  13. End-to-End Audio Strikes Back: Boosting Augmentations Towards An Efficient Audio Classification Network. arXiv:2204.11479.
  14. AudioSet: An ontology and human-labeled dataset for audio events. In IEEE International Conference on Acoustics, Speech and Signal Processing, 776–780.
  15. AST: Audio spectrogram transformer. arXiv:2104.01778.
  16. PSLA: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29: 3292–3306.
  17. SSAST: Self-supervised audio spectrogram transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, 10699–10709.
  18. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, 770–778.
  19. Huzaifah, M. 2017. Comparison of time-frequency representations for environmental sound classification using convolutional neural networks. arXiv preprint:1706.07156.
  20. Trainable wavelet-like transform for feature extraction to audio classification. In Journal of Physics: Conference Series, volume 1333, 032029. IOP Publishing.
  21. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, 448–456. PMLR.
  22. Slow-fast auditory streams for audio recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing, 855–859.
  23. Speaker identification using spectrograms of varying frame sizes. International Journal of Computer Applications, 50(20).
  24. Broadcasted residual learning for efficient keyword spotting. arxiv:2106.04140.
  25. AudioCaps: Generating captions for audios in the wild. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 119–132.
  26. PANNs: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28: 2880–2894.
  27. Decoupling magnitude and phase estimation with deep ResUNet for music source separation. arXiv:2109.05418.
  28. Speech enhancement with weakly labelled data from AudioSet. arXiv:2102.09971.
  29. Efficient training of audio transformers with Patchout. arXiv:2110.05069.
  30. A Dictionary Of Physics. OUP Oxford.
  31. Deep Learning. nature, 521(7553): 436–444.
  32. Handwritten digit recognition with a back-propagation network. Advances in Neural Information Processing Systems, 2.
  33. Raw waveform-based audio classification using sample-level CNN architectures. arxiv:1712.00866.
  34. Segment-level Metric Learning for Few-shot Bioacoustic Event Detection. arXiv:2207.07773.
  35. Channel-Wise Subband Input for Better Voice and Accompaniment Separation on High Resolution Music. Interspeech 2020, 1241–1245.
  36. Simple Pooling Front-ends For Efficient Audio Classification. In IEEE International Conference on Acoustics, Speech and Signal Processing.
  37. Mitchell, T. M. 1980. The need for biases in learning generalizations. 184–191.
  38. Neyshabur, B. 2017. Implicit regularization in deep learning. arXiv:1709.01953.
  39. SpecAugment: A simple data augmentation method for automatic speech recognition. arXiv:1904.08779.
  40. Pentreath, N. 2015. Machine Learning with Spark. Packt Publishing Birmingham.
  41. Piczak, K. J. 2015. ESC: Dataset for environmental sound classification. In ACM International Conference on Multimedia, 1015–1018.
  42. Interpretable convolutional filters with SincNet. arXiv:1811.09725.
  43. Speaker recognition from raw waveform with SincNet. In IEEE Spoken Language Technology Workshop, 1021–1028.
  44. Learning Strides in Convolutional Neural Networks. In International Conference on Learning Representations.
  45. Real time spectrogram inversion on mobile phone. arXiv:2203.00756.
  46. Learning filter banks within a deep neural network framework. In IEEE Workshop on Automatic Speech Recognition and Understanding, 297–302.
  47. On the information bottleneck theory of deep learning. Journal of Statistical Mechanics: Theory and Experiment, 2019(12): 124020.
  48. Shannon, C. E. 2001. A mathematical theory of communication. ACM Mobile Computing and Communications Review, 5(1): 3–55.
  49. Joint echo cancellation and noise suppression based on cascaded magnitude and complex mask estimation. arXiv:2107.09298.
  50. Opening the black box of deep neural networks via information. arXiv:1703.00810.
  51. A scale for the measurement of the psychological magnitude pitch. The Journal of the Acoustical Society of America., 8(3): 185–190.
  52. EfficientNet: Rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning, 6105–6114. PMLR.
  53. The information bottleneck method. arXiv physics/0004057.
  54. Attention is all you need. Advances in Neural Information Processing Systems, 30.
  55. Trainable frontend for robust and far-field keyword spotting. In IEEE International Conference on Acoustics, Speech and Signal Processing, 5670–5674.
  56. Warden, P. 2018. Speech commands: A dataset for limited-vocabulary speech recognition. arXiv:1804.03209.
  57. Norm-preservation: Why residual networks can become extremely deep? IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(11): 3980–3990.
  58. LEAF: A learnable frontend for audio classification. arxiv:2101.08596.
  59. Learning Filterbanks from Raw Speech for Phone Recognition. In IEEE Acoustics, Speech and Signal Processing.
  60. MixUp: Beyond empirical risk minimization. arXiv:1710.09412.
  61. Learning fast sample re-weighting without reward data. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 725–734.
Citations (4)

Summary

We haven't generated a summary for this paper yet.