Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Audio Time-Scale Modification with Temporal Compressing Networks (2210.17152v3)

Published 31 Oct 2022 in cs.SD and eess.AS

Abstract: We propose a novel approach for time-scale modification of audio signals. Unlike traditional methods that rely on the framing technique or the short-time Fourier transform to preserve the frequency during temporal stretching, our neural network model encodes the raw audio into a high-level latent representation, dubbed Neuralgram, where each vector represents 1024 audio sample points. Due to a sufficient compression ratio, we are able to apply arbitrary spatial interpolation of the Neuralgram to perform temporal stretching. Finally, a learned neural decoder synthesizes the time-scaled audio samples based on the stretched Neuralgram representation. Both the encoder and decoder are trained with latent regression losses and adversarial losses in order to obtain high-fidelity audio samples. Despite its simplicity, our method has comparable performance compared to the existing baselines and opens a new possibility in research into modern time-scale modification. Audio samples can be found at https://tsmnet-mmasia23.github.io

Definition Search Book Streamline Icon: https://streamlinehq.com
References (32)
  1. MEMC-Net: Motion Estimation and Motion Compensation Driven Neural Network for Video Interpolation and Enhancement. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) (2021).
  2. FMA: A Dataset For Music Analysis. arXiv preprint arXiv:1612.01840 (2016).
  3. Adversarial Audio Synthesis. In International Conference on Learning Representations (ICLR).
  4. Jonathan Driedger and Meinard Müller. 2016. A Review of Time-Scale Modification of Music Signals. Applied Sciences (2016).
  5. O.M. Essenwanger. 1986. Elements of Statistical Analysis.
  6. J. L. Flanagan and R. M. Golden. 1966. Phase Vocoder. Bell System Technical Journal (1966).
  7. Generative Adversarial Nets. In Advances in Neural Information Processing Systems (NeurIPS).
  8. D. Griffin and Jae Lim. 1984. Signal estimation from modified short-time Fourier transform. IEEE Transactions on Acoustics, Speech, and Signal Processing (1984).
  9. Don Hejna and Bruce R Musicus. 1991. The SOLAFS time-scale modification algorithm. Bolt, Beranek and Newman (BBN) Technical Report (1991).
  10. Image-To-Image Translation With Conditional Adversarial Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  11. FloWaveNet : A Generative Flow for Raw Audio. In Proceedings of Machine Learning Research (PMLR).
  12. Improved PVSOLA time-stretching and pitch-shifting for polyphonic audio. In Proceedings of the International Conference on Digital Audio Effects (DAFx).
  13. Mark A. Kramer. 1991. Nonlinear principal component analysis using autoassociative neural networks. AIChE Journal (1991).
  14. MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis. In Advances in Neural Information Processing Systems (NeurIPS).
  15. J. Laroche and M. Dolson. 1999. Improved phase vocoder time-scale modification of audio. IEEE Transactions on Speech and Audio Processing (1999).
  16. Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  17. Jae Hyun Lim and Jong Chul Ye. 2017. Geometric GAN. arXiv preprint arXiv:1705.02894 (2017).
  18. Deep multi-scale video prediction beyond mean square error. arXiv preprint arXiv:1511.05440 (2016).
  19. Alexis Moinet and Thierry Dutoit. 2011. PVSOLA: A phase vocoder with synchronized overlap-add. In Proceedings of the International Conference on Digital Audio Effects (DAFx).
  20. WORLD: a vocoder-based high-quality speech synthesis system for real-time applications. IEICE Transactions on Information and Systems (2016).
  21. Eric Moulines and Francis Charpentier. 1990. Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones. Speech Communication (1990).
  22. Frederik Nagel and Andreas Walther. 2009. A Novel Transient Handling Scheme for Time Stretching Algorithms. In Audio Engineering Society Convention.
  23. Deconvolution and Checkerboard Artifacts. Distill (2016).
  24. Waveglow: A Flow-based Generative Network for Speech Synthesis. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
  25. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. arXiv preprint arXiv:1511.06434 (2016).
  26. Tim Salimans and Durk P Kingma. 2016. Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks. In Advances in Neural Information Processing Systems (NeurIPS).
  27. A Survey on Neural Speech Synthesis. arXiv preprint arXiv:2106.15561 (2021).
  28. Invariances and Data Augmentation for Supervised Music Transcription. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
  29. WaveNet: A Generative Model for Raw Audio. arXiv preprint arXiv:1609.03499 (2016).
  30. W. Verhelst and M. Roelands. 1993. An overlap-add technique based on waveform similarity (WSOLA) for high quality time-scale modification of speech. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP).
  31. CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit (version 0.92).
  32. VocGAN: A High-Fidelity Real-time Vocoder with a Hierarchically-nested Adversarial Network. Interspeech (2020).

Summary

We haven't generated a summary for this paper yet.