Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SuperCodec: A Neural Speech Codec with Selective Back-Projection Network (2407.20530v1)

Published 30 Jul 2024 in cs.SD and eess.AS

Abstract: Neural speech coding is a rapidly developing topic, where state-of-the-art approaches now exhibit superior compression performance than conventional methods. Despite significant progress, existing methods still have limitations in preserving and reconstructing fine details for optimal reconstruction, especially at low bitrates. In this study, we introduce SuperCodec, a neural speech codec that achieves state-of-the-art performance at low bitrates. It employs a novel back projection method with selective feature fusion for augmented representation. Specifically, we propose to use Selective Up-sampling Back Projection (SUBP) and Selective Down-sampling Back Projection (SDBP) modules to replace the standard up- and down-sampling layers at the encoder and decoder, respectively. Experimental results show that our method outperforms the existing neural speech codecs operating at various bitrates. Specifically, our proposed method can achieve higher quality reconstructed speech at 1 kbps than Lyra V2 at 3.2 kbps and Encodec at 6 kbps.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (24)
  1. “Definition of the opus audio codec,” Tech. Rep., 2012.
  2. D Rowe, “Codec 2-open source speech coding at 2400 bits/s and below,” in TAPR and ARRL 30th Digital Communications Conference, 2011, pp. 80–84.
  3. “Melp: the new federal standard at 2400 bps,” in 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing. IEEE, 1997, vol. 2, pp. 1591–1594.
  4. “Wavenet based low rate speech coding,” in 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2018, pp. 676–680.
  5. “A Real-Time Wideband Neural Vocoder at 1.6kb/s Using LPCNet,” in Proc. Interspeech 2019, 2019, pp. 3406–3410.
  6. “Generative speech coding with predictive variance regularization,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 6478–6482.
  7. “A streamwise gan vocoder for wideband speech coding at very low bit rate,” in 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2021, pp. 66–70.
  8. “CQNV: A Combination of Coarsely Quantized Bitstream and Neural Vocoder for Low Rate Speech Coding,” in Proc. INTERSPEECH 2023, 2023, pp. 171–175.
  9. “Soundstream: An end-to-end neural audio codec,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 495–507, 2021.
  10. “NESC: Robust Neural End-2-End Speech Coding with GANs,” in Proc. Interspeech 2022, 2022, pp. 4212–4216.
  11. “End-to-end neural speech coding for real-time communications,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 866–870.
  12. “High fidelity neural audio compression,” arXiv preprint arXiv:2210.13438, 2022.
  13. “Lmcodec: A low bitrate speech codec with causal transformer models,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
  14. “Disentangled feature learning for real-time neural speech coding,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
  15. “Neural discrete representation learning,” Advances in neural information processing systems, vol. 30, 2017.
  16. “All information is necessary: Integrating speech positive and negative information by contrastive learning for speech enhancement,” arXiv preprint arXiv:2304.13439, 2023.
  17. “Selective kernel networks,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 510–519.
  18. “Case-net: Integrating local and non-local attention operations for speech enhancement,” Speech Communication, 2023.
  19. Junichi Yamagishi et al., “Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit (version 0.92),” 2019.
  20. RB ITU-R, “1534-1, method for the subjective assessment of intermediate quality levels of coding systems (mushra),”,” International Telecommunication Union, 2003.
  21. Jean-Marc Valin, “The speex codec manual version 1.2 beta 3,” Xiph. org Foundation, 2007.
  22. “A short-time objective intelligibility measure for time-frequency weighted noisy speech,” in 2010 IEEE international conference on acoustics, speech and signal processing. IEEE, 2010, pp. 4214–4217.
  23. Michael Chinen et al., “Visqol v3: An open source production ready objective speech and audio metric,” in 2020 twelfth international conference on quality of multimedia experience (QoMEX). IEEE, 2020, pp. 1–6.
  24. “Warp-q: Quality prediction for generative neural speech codecs,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 401–405.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Youqiang Zheng (3 papers)
  2. Weiping Tu (17 papers)
  3. Li Xiao (85 papers)
  4. Xinmeng Xu (17 papers)
Citations (2)