Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
81 tokens/sec
Gemini 2.5 Pro Premium
33 tokens/sec
GPT-5 Medium
31 tokens/sec
GPT-5 High Premium
22 tokens/sec
GPT-4o
78 tokens/sec
DeepSeek R1 via Azure Premium
92 tokens/sec
GPT OSS 120B via Groq Premium
436 tokens/sec
Kimi K2 via Groq Premium
209 tokens/sec
2000 character limit reached

SoundCount: Sound Counting from Raw Audio with Dyadic Decomposition Neural Network (2312.16149v1)

Published 26 Dec 2023 in cs.SD and eess.AS

Abstract: In this paper, we study an underexplored, yet important and challenging problem: counting the number of distinct sounds in raw audio characterized by a high degree of polyphonicity. We do so by systematically proposing a novel end-to-end trainable neural network (which we call DyDecNet, consisting of a dyadic decomposition front-end and backbone network), and quantifying the difficulty level of counting depending on sound polyphonicity. The dyadic decomposition front-end progressively decomposes the raw waveform dyadically along the frequency axis to obtain time-frequency representation in multi-stage, coarse-to-fine manner. Each intermediate waveform convolved by a parent filter is further processed by a pair of child filters that evenly split the parent filter's carried frequency response, with the higher-half child filter encoding the detail and lower-half child filter encoding the approximation. We further introduce an energy gain normalization to normalize sound loudness variance and spectrum overlap, and apply it to each intermediate parent waveform before feeding it to the two child filters. To better quantify sound counting difficulty level, we further design three polyphony-aware metrics: polyphony ratio, max polyphony and mean polyphony. We test DyDecNet on various datasets to show its superiority, and we further show dyadic decomposition network can be used as a general front-end to tackle other acoustic tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (57)
  1. Sound Event Detection Using Spatial Features and Convolutional Recurrent Neural Network. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
  2. Sound Event Localization and Detection of Overlapping Sources Using Convolutional Recurrent Neural Networks. In IEEE Journal of Selected Topics in Signal Processing.
  3. On the Importance of Passive Acoustic Monitoring Filters. Journal of Marine Science and Engineering (JMSE).
  4. Joint Time–Frequency Scattering. In IEEE Transaction on Signal Processing.
  5. Sound Analysis in Smart Cities. Springer International Publishing.
  6. Brossier, P. 2006. Automatic Annotation of Musical Audio for Interactive System. Ph.D. thesis, Queen Mary University of London.
  7. Scattering Invariant Deep Networks for Classification. In IEEE Transaction on Pattern Anlysis and Machine Intelligence (T-PAMI).
  8. A Flexible Framework for Key Audio Effects Detection and Auditory Context Inference. IEEE Transaction on Audio, Speech and Language Processing (TASLP).
  9. Filterbank Learning for Deep Neural Network Based Polyphonic Sound Event Detection. In International Joint Conference on Neural Networks (IJCNN).
  10. Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection. In IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP).
  11. An Improved Event-Independent Network for Polyphonic Sound Event Localization and Detection. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
  12. Privacy Preserving Crowd Monitoring: Counting People without People Models or Tracking. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  13. An Annotated Set of Audio Recordings of Eastern North American Birds Containing Frequency, Time, and Species Information.
  14. Empirical Evaluation of Gated Recurrent Neural Ntworks on Sequence Modelling. In Advances Neural Information Processing System (NeurIPS).
  15. Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences. In IEEE Transactions on Acoustics, Speech, and Signal Processing (TASSP).
  16. Flow-Based Self-Supervised Density Estimation for Anomalous Sound Detection. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
  17. Sound Event Detection with Depthwise Separable and Dilated Convolutions. In International Joint Conference on Neural Networks (IJCNN).
  18. Audio Set: An Ontology and Human-Labeled Dataset for Audio Events. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
  19. A Study of the Complexity and Accuracy of Direction of Arrival Estimation Methods Based on GCC-PHAT for a Pair of Close Microphones. In Proceedings of the Detection and Classification of Acoustic Scenes and Events Workshop.
  20. SoundDoA: Learn Sound Source Direction of Arrival and Semantics from Sound Raw Waveforms. In Interspeech.
  21. SoundSynp: Sound Source Detection from Raw Waveforms with Multi-Scale Synperiodic Filterbanks. In International Conference on Artificial Intelligence and Statistics (AISTATS).
  22. SoundDet: Polyphonic Moving Sound Event Detection and Localization from Raw Waveform. In International Conference on Machine Learning (ICML).
  23. Audio Context Recognition using Audio Event Histograms. In European Signal Processing Conference (EUSIPCO).
  24. Long Short-Term Memory. In Neural Computation.
  25. Ambient Sound Helps: Audiovisual Crowd Counting in Extreme Conditions. arXiv:2005.07097.
  26. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In International Conference on Machine Learning (ICML).
  27. OpenMIC-2018: An Open Dataset for Multiple Instrument Recognition. In Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR).
  28. Adam: A Method for Stochastic Optimization. In International Conference on Learning Representation (ICLR).
  29. Pedestrian Detection in Crowded Scenes. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
  30. Learning To Count Objects in Images. In Advances in Neural Information Processing Systems (NeurIPS).
  31. Pyramid Attention Network for Semantic Segmentation. In British Machine Vision Conference (BMVC).
  32. CSRNet: Dilated Convolutional Neural Networks for Understanding the Highly Congested Scenes. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
  33. Feature Pyramid Networks for Object Detection. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
  34. Per-Channel Energy Normalization: Why and How. IEEE Signal Processing Letters (SPL).
  35. Crowd Counting and Profiling: Methodology and Evaluation. In Modeling, Simulation and Visual Analysis of Crowds.
  36. Mallat, S. 2008. A Wavelet Tour of Signal Processing, Third Edition: The Sparse Way. USA: Academic Press, Inc., 3rd edition.
  37. librosa: Audio and Music Signal Analysis in Python. In Proceedings of the 14th python in science conference, volume 8.
  38. Sound Event Detection: A Tutorial. IEEE Signal Processing Magazine,.
  39. Fully Learnable Deep Wavelet Transform for Unsupervised Monitoring of High-Frequency Time Series. Proceedings of the National Academy of Sciences.
  40. Multi-talker ASR for an Unknown Number of Sources: Joint Training of Source Counting, Separation and ASR. In Interspeech.
  41. Estimating Phoneme Class Conditional Probabilities from Raw Speech Signal using Convolutional Neural Networks. In Interspeech.
  42. Polyphonic Sound Event and Sound Activity Detection: A Multi-Task Approach. 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 323–327.
  43. Acoustic Event Detection in Real Life Recordings. In 18th European Signal Processing Conference (EUSIPCO).
  44. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems (NeurIPS).
  45. Polyphonic Audio Event Detection: Multi-Label or Multi-Class Multi-Task Classification Problem? In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
  46. Speaker Recognition from Raw Waveform with SincNet. In In IEEE Workshop on Spoken Language Technology (SLT).
  47. Feature Pyramid Network for Multi-class Land Segmentation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).
  48. Automatic acoustic detection of birds through deep learning: The first Bird Audio Detection challenge. Methods in Ecology and Evolution,.
  49. Resource-Efficient Separation Transformer. arXiv:2206.09507.
  50. Sound Event Detection and Separation: A Benchmark on Desed Synthetic Soundscapes. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
  51. Acoustic Modeling with Deep Neural Networks Using Raw Time Signal for LVCSR. In Interspeech.
  52. Two-Step Sound Source Separation: Training On Learned Latent Targets. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
  53. Compute and Memory Efficient Universal Sound Source Separation. In Journal of Signal Processing Systems.
  54. Learning from Synthetic Data for Crowd Counting in the Wild. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
  55. Trainable Frontend for Robust and Far-Field Keyword Spotting. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
  56. LEAF: A Learnable Frontend for Audio Classification. International Conference on Learning Representations (ICLR).
  57. Single-Image Crowd Counting via Multi-Column Convolutional Neural Network. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.