Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

tinyCLAP: Distilling Constrastive Language-Audio Pretrained Models (2311.14517v3)

Published 24 Nov 2023 in cs.SD, cs.CL, cs.LG, and eess.AS

Abstract: Contrastive Language-Audio Pretraining (CLAP) became of crucial importance in the field of audio and speech processing. Its employment ranges from sound event detection to text-to-audio generation. However, one of the main limitations is the considerable amount of data required in the training process and the overall computational complexity during inference. This paper investigates how we can reduce the complexity of contrastive language-audio pre-trained models, yielding an efficient model that we call tinyCLAP. We derive an unimodal distillation loss from first principles and explore how the dimensionality of the shared, multimodal latent space can be reduced via pruning. TinyCLAP uses only 6% of the original Microsoft CLAP parameters with a minimal reduction (less than 5%) in zero-shot classification performance across the three sound event detection datasets on which it was tested

Definition Search Book Streamline Icon: https://streamlinehq.com
References (24)
  1. “Clap learning audio concepts from natural language supervision,” in 2023 IEEE International Conference on Acoustics, Speech and Signal Processing, 2023.
  2. “Learning transferable visual models from natural language supervision,” in International Conference on Machine Learning, 2021.
  3. “Unsupervised improvement of audio-text cross-modal representations,” ArXiv, 2023.
  4. “Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2023.
  5. “Zero-shot text-to-image generation,” in Proceedings of the 38th International Conference on Machine Learning, M. Meila and T. Zhang, Eds. 18–24 Jul 2021, Proceedings of Machine Learning Research, PMLR.
  6. “Audioldm: Text-to-audio generation with latent diffusion models,” in International Conference on Machine Learning, 2023.
  7. “Distilling the knowledge in a neural network,” ArXiv, 2015.
  8. “What is the state of neural network pruning?,” in Proceedings of Machine Learning and Systems, I. Dhillon, D. Papailiopoulos, and V. Sze, Eds., 2020.
  9. “Knowledge distillation: A survey,” International Journal of Computer Vision, 2020.
  10. E. Schubert, “A triangle inequality for cosine similarity,” in Similarity Search and Applications, 2021.
  11. “Phinets: A scalable backbone for low-power ai at the edge,” ACM Transactions on Embedded Computing Systems, 2021.
  12. “Optimizing phinet architectures for the detection of urban sounds on low-end devices,” 2022 30th European Signal Processing Conference (EUSIPCO), 2022.
  13. “Scalable neural architectures for end-to-end environmental sound classification,” IEEE International Conference on Acoustics, Speech and Signal Processing, 2022.
  14. “Improving latency performance trade-off in keyword spotting applications at the edge,” in 2023 9th International Workshop on Advances in Sensors and Interfaces (IWASI), 2023.
  15. “Panns: Large-scale pretrained audio neural networks for audio pattern recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2019.
  16. J. D. M.-W. C. Kenton and L. K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of naacL-HLT, 2019.
  17. “Audiocaps: Generating captions for audios in the wild,” in North American Chapter of the Association for Computational Linguistics, 2019.
  18. “Macs - multi-annotator captioned soundscapes,” 2021.
  19. “Fsd50k: An open dataset of human-labeled sound events,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 829–852, 2020.
  20. “Clotho: an audio captioning dataset,” ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 736–740, 2019.
  21. “Language modelling for sound event detection with teacher forcing and scheduled sampling,” in Workshop on Detection and Classification of Acoustic Scenes and Events, 2019.
  22. K. J. Piczak, “ESC: Dataset for Environmental Sound Classification,” in Proceedings of the 23rd Annual ACM Conference on Multimedia. ACM Press.
  23. “A dataset and taxonomy for urban sound research,” Proceedings of the 22nd ACM international conference on Multimedia, 2014.
  24. “Depgraph: Towards any structural pruning,” 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Francesco Paissan (18 papers)
  2. Elisabetta Farella (13 papers)
Citations (2)

Summary

We haven't generated a summary for this paper yet.