An Empirical Study on the Impact of Positional Encoding in Transformer-based Monaural Speech Enhancement (2401.09686v2)
Abstract: Transformer architecture has enabled recent progress in speech enhancement. Since Transformers are position-agostic, positional encoding is the de facto standard component used to enable Transformers to distinguish the order of elements in a sequence. However, it remains unclear how positional encoding exactly impacts speech enhancement based on Transformer architectures. In this paper, we perform a comprehensive empirical study evaluating five positional encoding methods, i.e., Sinusoidal and learned absolute position embedding (APE), T5-RPE, KERPLE, as well as the Transformer without positional encoding (No-Pos), across both causal and noncausal configurations. We conduct extensive speech enhancement experiments, involving spectral mapping and masking methods. Our findings establish that positional encoding is not quite helpful for the models in a causal configuration, which indicates that causal attention may implicitly incorporate position information. In a noncausal configuration, the models significantly benefit from the use of positional encoding. In addition, we find that among the four position embeddings, relative position embeddings outperform APEs.
- V. Y. Chua, H. Liu, L. P. G. Perera, F. T. Woon, J. Wong, X. Zhang, S. Khudanpur et al., “Merlion ccs challenge: A english-mandarin code-switching child-directed speech corpus for language identification and diarization,” arXiv preprint arXiv:2305.18881, 2023.
- Y. Ephraim and D. Malah, “Speech enhancement using a minimum mean-square error log-spectral amplitude estimator,” IEEE Trans. Acoust., Speech, Signal Process., vol. 33, no. 2, pp. 443–445, 1985.
- Q. Zhang, M. Wang, Y. Lu, L. Zhang, and M. Idrees, “A novel fast nonstationary noise tracking approach based on mmse spectral power estimator,” Digital Signal Processing, vol. 88, pp. 41–52, 2019.
- Q. Zhang, M. Wang, Y. Lu, M. Idrees, and L. Zhang, “Fast nonstationary noise tracking based on log-spectral power mmse estimator and temporal recursive averaging,” IEEE Access, vol. 7, pp. 80 985–80 999, 2019.
- D. Wang and J. Chen, “Supervised speech separation based on deep learning: An overview,” IEEE/ACM TASLP, vol. 26, no. 10, pp. 1702–1726, 2018.
- M. Kolbæk, Z.-H. Tan, S. H. Jensen, and J. Jensen, “On loss functions for supervised monaural time-domain speech enhancement,” IEEE/ACM Trans. Audio, speech, Lang. Process., vol. 28, pp. 825–838, 2020.
- Z. Kong, W. Ping, A. Dantrey, and B. Catanzaro, “Speech denoising in the waveform domain with self-attention,” in Proc. ICASSP, 2022, pp. 7867–7871.
- Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, “A regression approach to speech enhancement based on deep neural networks,” IEEE/ACM Trans. Audio, speech, Lang. Process., vol. 23, no. 1, pp. 7–19, 2014.
- K. Tan and D. Wang, “Learning complex spectral mapping with gated convolutional recurrent networks for monaural speech enhancement,” IEEE/ACM TASLP, vol. 28, pp. 380–390, 2019.
- Y. Wang, A. Narayanan, and D. Wang, “On training targets for supervised speech separation,” IEEE/ACM TASLP, vol. 22, no. 12, pp. 1849–1858, 2014.
- D. S. Williamson, Y. Wang, and D. Wang, “Complex ratio masking for monaural speech separation,” IEEE/ACM TASLP, vol. 24, no. 3, pp. 483–492, 2015.
- H. Erdogan, J. R. Hershey, S. Watanabe, and J. Le Roux, “Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks,” in Proc. ICASSP, 2015, pp. 708–712.
- F. Weninger, H. Erdogan, S. Watanabe, E. Vincent, J. L. Roux, J. R. Hershey, and B. Schuller, “Speech enhancement with lstm recurrent neural networks and its application to noise-robust asr,” in LVA/ICA. Springer, 2015, pp. 91–99.
- J. Chen and D. Wang, “Long short-term memory for speaker generalization in supervised speech separation,” The Journal of the Acoustical Society of America, vol. 141, no. 6, pp. 4705–4714, 2017.
- S. Bai, J. Z. Kolter, and V. Koltun, “An empirical evaluation of generic convolutional and recurrent networks for sequence modeling,” arXiv preprint arXiv:1803.01271, 2018.
- Q. Zhang, A. Nicolson, M. Wang, K. K. Paliwal, and C. Wang, “DeepMMSE: A deep learning approach to mmse-based noise power spectral density estimation,” IEEE/ACM TASLP, vol. 28, pp. 1404–1415, 2020.
- Q. Zhang, Q. Song, Z. Ni, A. Nicolson, and H. Li, “Time-frequency attention for monaural speech enhancement,” in Proc. ICASSP, 2022, pp. 7852–7856.
- Y. Zhao, D. Wang, B. Xu, and T. Zhang, “Monaural speech dereverberation using temporal convolutional networks with self attention,” IEEE/ACM TASLP, vol. 28, pp. 1598–1607, 2020.
- Q. Zhang, Q. Song, A. Nicolson, T. Lan, and H. Li, “Temporal convolutional network with frequency dimension adaptive attention for speech enhancement,” Proc. INTERSPEECH, pp. 166–170, 2021.
- D. de Oliveira, T. Peer, and T. Gerkmann, “Efficient transformer-based speech enhancement using long frames and STFT magnitudes,” in Proc. INTERSPEECH, 2022, pp. 2948–2952.
- Q. Zhang, X. Qian, Z. Ni, A. Nicolson, E. Ambikairajah, and H. Li, “A time-frequency attention module for neural speech enhancement,” IEEE/ACM TASLP, vol. 31, pp. 462–475, 2023.
- Q. Zhang, H. Zhu, Q. Song, X. Qian, Z. Ni, and H. Li, “Ripple sparse self-attention for monaural speech enhancement,” in Proc. ICASSP, 2023, pp. 1–5.
- J. Kim, M. El-Khamy, and J. Lee, “T-GSA: Transformer with gaussian-weighted self-attention for speech enhancement,” in Proc. ICASSP, 2020, pp. 6649–6653.
- N.-Q. Pham, T.-L. Ha, T.-N. Nguyen, T.-S. Nguyen, E. Salesky, S. Stüker et al., “Relative positional encoding for speech recognition and direct translation,” in Proc. INTERSPEECH, 2020, pp. 31–35.
- A. Nicolson and K. K. Paliwal, “Masked multi-head self-attention for causal speech enhancement,” Speech Com., vol. 125, pp. 80–96, 2020.
- S.-W. Fu, C.-F. Liao, T.-A. Hsieh, K.-H. Hung, S.-S. Wang, C. Yu et al., “Boosting objective scores of a speech enhancement model by metricgan post-processing,” in APSIPA ASC, 2020, pp. 455–459.
- C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, P. J. Liu et al., “Exploring the limits of transfer learning with a unified text-to-text transformer.” J. Mach. Learn. Res., vol. 21, no. 140, pp. 1–67, 2020.
- T.-C. Chi, T.-H. Fan, P. J. Ramadge, and A. Rudnicky, “Kerple: Kernelized relative positional embedding for length extrapolation,” Proc. NeurIPS, vol. 35, pp. 8386–8399, 2022.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in NeurIPS, 2017, pp. 5998–6008.
- J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
- V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an asr corpus based on public domain audio books,” in Proc. ICASSP, 2015, pp. 5206–5210.
- D. Snyder, G. Chen, and D. Povey, “MUSAN: A music, speech, and noise corpus,” arXiv preprint arXiv:1510.08484, 2015.
- H. J. Steeneken and F. W. Geurtsen, “Description of the RSG-10 noise database,” report IZF, vol. 3, p. 1988, 1988.
- F. Saki and N. Kehtarnavaz, “Automatic switching between noise classification and speech enhancement for hearing aid devices,” in in Proc. EMBC, 2016, pp. 736–739.
- F. Saki, A. Sehgal, I. Panahi, and N. Kehtarnavaz, “Smartphone-based real-time classification of noise signals using subband features and random forest classifier,” in Proc. ICASSP, 2016, pp. 2204–2208.
- J. Salamon, C. Jacoby, and J. P. Bello, “A dataset and taxonomy for urban sound research,” in Proc. ACM-MM, 2014, pp. 1041–1044.
- D. B. Dean, S. Sridharan, R. J. Vogt, and M. W. Mason, “The QUT-NOISE-TIMIT corpus for the evaluation of voice activity detection algorithms,” in Proc. INTERSPEECH, 2010.
- G. Hu and D. Wang, “A tandem algorithm for pitch estimation and voiced speech segregation,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 8, pp. 2067–2079, 2010.
- D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
- I.-T. Recommendation, “Perceptual evaluation of speech quality (PESQ): An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs,” Rec. ITU-T P. 862, 2001.
- J. Jensen and C. H. Taal, “An algorithm for predicting the intelligibility of speech masked by modulated noise maskers,” IEEE/ACM TASLP, vol. 24, no. 11, pp. 2009–2022, 2016.
- Y. Hu and P. C. Loizou, “Evaluation of objective quality measures for speech enhancement,” IEEE Trans. Audio, Speech, Lang. process., vol. 16, no. 1, pp. 229–238, 2007.
- Qiquan Zhang (20 papers)
- Meng Ge (29 papers)
- Hongxu Zhu (7 papers)
- Eliathamby Ambikairajah (11 papers)
- Qi Song (73 papers)
- Zhaoheng Ni (32 papers)
- Haizhou Li (285 papers)