BANC: Towards Efficient Binaural Audio Neural Codec for Overlapping Speech (2309.07416v4)
Abstract: We introduce BANC, a neural binaural audio codec designed for efficient speech compression in single and two-speaker scenarios while preserving the spatial location information of each speaker. Our key contributions are as follows: 1) The ability of our proposed model to compress and decode overlapping speech. 2) A novel architecture that compresses speech content and spatial cues separately, ensuring the preservation of each speaker's spatial context after decoding. 3) BANC's proficiency in reducing the bandwidth required for compressing binaural speech by 48% compared to compressing individual binaural channels. In our evaluation, we employed speech enhancement, room acoustics, and perceptual metrics to assess the accuracy of BANC's clean speech and spatial cue estimates.
- “Improving opus low bit rate quality with neural speech synthesis,” in INTERSPEECH. 2020, pp. 2847–2851, ISCA.
- “High-quality speech coding with sample RNN,” in ICASSP. 2019, pp. 7155–7159, IEEE.
- “Source coding of audio signals with a generative model,” in ICASSP. 2020, pp. 341–345, IEEE.
- “Soundstream: An end-to-end neural audio codec,” IEEE ACM Trans. Audio Speech Lang. Process., vol. 30, pp. 495–507, 2022.
- “High fidelity neural audio compression,” CoRR, vol. abs/2210.13438, 2022.
- “Hifi-codec: Group-residual vector quantization for high fidelity audio codec,” CoRR, vol. abs/2305.02765, 2023.
- “Audiodec: An open-source streaming high-fidelity neural audio codec,” in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5.
- “The dominant role of low-frequency interaural time differences in sound localization.,” The Journal of the Acoustical Society of America, vol. 91 3, pp. 1648–61, 1992.
- “Definition of the opus audio codec,” RFC, vol. 6716, pp. 1–326, 2012.
- “Speech analysis and synthesis by linear prediction of the speech wave,” The Journal of the Acoustical Society of America, vol. 50, no. 2B, pp. 637–655, 1971.
- Alan McCree et al., “A 2.4 kbit/s MELP coder candidate for the new U.S. federal standard,” in ICASSP. 1996, pp. 200–203, IEEE Computer Society.
- D. Griffin and Jae Lim, “A new model-based speech analysis/synthesis system,” in ICASSP ’85. IEEE International Conference on Acoustics, Speech, and Signal Processing, 1985, vol. 10, pp. 513–516.
- “Overview of the evs codec architecture,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 5698–5702.
- “Audio-visual speech codecs: Rethinking audio-visual speech enhancement by re-synthesis,” in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 8217–8227.
- “Fast-rir: Fast neural diffuse room impulse response generator,” in ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 571–575.
- “Mesh2ir: Neural acoustic impulse response generator for complex 3d scenes,” in Proceedings of the 30th ACM International Conference on Multimedia, New York, NY, USA, 2022, MM ’22, p. 924–933, Association for Computing Machinery.
- “Towards improved room impulse response estimation for speech recognition,” in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5.
- “Filtered noise shaping for time domain room impulse response estimation from reverberant speech,” in 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2021, pp. 221–225.
- Yi Luo and Nima Mesgarani, “Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 8, pp. 1256–1266, 2019.
- “Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,” in Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, Eds. 2020, vol. 33, pp. 17022–17033, Curran Associates, Inc.
- “Neural discrete representation learning,” in Advances in Neural Information Processing Systems, I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds. 2017, vol. 30, Curran Associates, Inc.
- Cassia Valentini-Botinhao, “Noisy speech database for training speech enhancement algorithms and TTS models,” 2017.
- “CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit (version 0.92),” 2019.
- Zhenyu Tang et al., “Improving reverberant speech training using diffuse acoustic simulation,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 6969–6973.
- “A short-time objective intelligibility measure for time-frequency weighted noisy speech,” in 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, 2010, pp. 4214–4217.
- “IR-GAN: Room Impulse Response Generator for Far-Field Speech Recognition,” in Proc. Interspeech 2021, 2021, pp. 286–290.
- “Ts-rir: Translated synthetic room impulse responses for speech augmentation,” in 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2021, pp. 259–266.
- “Real-time binaural speech separation with preserved spatial cues,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 6404–6408.
- C. Knapp and G. Carter, “The generalized correlation method for estimation of time delay,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 24, no. 4, pp. 320–327, 1976.