Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
153 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Hyperbolic Distance-Based Speech Separation (2401.03567v1)

Published 7 Jan 2024 in eess.AS and cs.SD

Abstract: In this work, we explore the task of hierarchical distance-based speech separation defined on a hyperbolic manifold. Based on the recent advent of audio-related tasks performed in non-Euclidean spaces, we propose to make use of the Poincar\'e ball to effectively unveil the inherent hierarchical structure found in complex speaker mixtures. We design two sets of experiments in which the distance-based parent sound classes, namely "near" and "far", can contain up to two or three speakers (i.e., children) each. We show that our hyperbolic approach is suitable for unveiling hierarchical structure from the problem definition, resulting in improved child-level separation. We further show that a clear correlation emerges between the notion of hyperbolic certainty (i.e., the distance to the ball's origin) and acoustic semantics such as speaker density, inter-source location, and microphone-to-speaker distance.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. J. H. McDermott, “The cocktail party problem,” Current Biology, vol. 19, no. 22, pp. R1024–R1027, 2009.
  2. C. M. Karns, E. Isbell, R. J. Giuliano, and H. J. Neville, “Auditory attention in childhood and adolescence: An event-related potential study of spatial selective attention to one of two simultaneous stories,” Developmental Cognitive Neuroscience, vol. 13, pp. 53 – 67, 2015.
  3. D. L. Wang and J. Chen, “Supervised Speech Separation Based on Deep Learning: An Overview,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 10, pp. 1702–1726, 2018.
  4. X. Liu et al., “Separate What You Describe: Language-Queried Audio Source Separation,” in Proc. Interspeech, 2022.
  5. E. Tzinis et al., “Heterogeneous Target Speech Separation,” in Proc. Interspeech, 2022, pp. 1796–1800.
  6. C. J. Darwin and R. W. Hukin, “Effectiveness of spatial cues, prosody, and talker characteristics in selective attention,” The Journal of the Acoustical Society of America, vol. 107, no. 2, pp. 970–977, 2000.
  7. G. Rongzhi et al., “Neural spatial filter: Target speaker speech separation assisted with directional information,” in Proc. Interspeech, 09 2019, pp. 4290–4294.
  8. J. Heitkaemper, T. Fehér, M. Freitag, and R. Haeb-Umbach, “A study on online source extraction in the presence of changing speaker positions,” in Statistical Language and Speech Processing, C. Martín-Vide, M. Purver, and S. Pollak, Eds.   Springer International Publishing, 2019, pp. 198–209.
  9. S. E. Eskimez et al., “Personalized Speech Enhancement: New Models and Comprehensive Evaluation,” in Proc. of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2022, pp. 356–360.
  10. K. Patterson, K. W. Wilson, S. Wisdom, and J. R. Hershey, “Distance-based sound separation,” in Interspeech 2022, 23rd Annual Conference of the International Speech Communication Association, Incheon, Korea, 18-22 September 2022, H. Ko and J. H. L. Hansen, Eds.   ISCA, 2022, pp. 901–905.
  11. M. Delcroix et al., “Single Channel Target Speaker Extraction and Recognition with Speaker Beam,” in Proc. of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2018, pp. 5554–5558.
  12. M. G. Atigh et al., “Hyperbolic image segmentation,” in Proc. of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2022.
  13. V. Khrulkov et al., “Hyperbolic image embeddings,” in Proc. of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
  14. R. Shimizu, Y. Mukuta, and T. Harada, “Hyperbolic neural networks++,” in International Conference on Learning Representations, 2021.
  15. D. Petermann, G. Wichern, A. Subramanian, and J. Le Roux, “Hyperbolic audio source separation,” in Proc. of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Jun. 2023.
  16. F. Nakashima et al., “Hyperbolic timbre embedding for musical instrument sound synthesis based on variational autoencoders,” in Proc. APSIPA ASC, 2022.
  17. F. Germain, G. Wichern, and J. Le Roux, “Hyperbolic unsupervised anomalous sound detection,” in Proc. of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2023.
  18. R. Sarkar, “Low distortion delaunay embedding of trees in hyperbolic plane,” in Proc. International Symposium On Graph Drawing, 2012, pp. 355–366.
  19. D. Surís, R. Liu, and C. Vondrick, “Learning the predictability of the future,” in Proc. of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  20. J. Le Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “SDR – half-baked or well done?” in Proc. of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2019.
  21. H. Fang, D. Becker, S. Wermter, and T. Gerkmann, “Integrating uncertainty into neural network-based speech enhancement,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 1587–1600, 2023.
  22. M. Nickel and D. Kiela, “Poincaré embeddings for learning hierarchical representations,” in Advances in Neural Information Processing Systems, I. Guyon et al., Eds., vol. 30.   Curran Associates, Inc., 2017.
  23. W. Chen et al., “Fully hyperbolic neural networks,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).   Dublin, Ireland: Association for Computational Linguistics, May 2022, pp. 5672–5686.
  24. R. Scheibler, E. Bezzam, and I. Dokmanić, “Pyroomacoustics: A python package for audio room simulation and array processing algorithms,” in Proc. of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2018, p. 351–355.
  25. J. Kahn et al., “Libri-light: A benchmark for asr with limited or no supervision,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 7669–7673.
  26. V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An ASR corpus based on public domain audio books,” in Proc. of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2015, pp. 5206–5210.
  27. A. Subramanian, S. J. Chen, and S. Watanabe, “Student-teacher learning for blstm mask-based speech enhancement,” in Proc. Interspeech, 09 2018, pp. 3249–3253.
  28. M. Kolbæk, D. Yu, Z. H. Tan, and J. Jensen, “Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 10, pp. 1901–1913, 2017.
  29. H. Kasai, P. Jawanpuria, and B. Mishra, “Riemannian adaptive stochastic gradient algorithms on matrix manifolds,” in Proc. of the International Conference on Machine Learning (ICML), Jun. 2019.
  30. M. Kochurov, R. Karimov, and S. Kozlukov, “Geoopt: Riemannian optimization in PyTorch,” arXiv preprint arXiv:2005.02819, 2020.
Citations (4)

Summary

We haven't generated a summary for this paper yet.