Audio Time-Scale Modification with Temporal Compressing Networks (2210.17152v3)
Abstract: We propose a novel approach for time-scale modification of audio signals. Unlike traditional methods that rely on the framing technique or the short-time Fourier transform to preserve the frequency during temporal stretching, our neural network model encodes the raw audio into a high-level latent representation, dubbed Neuralgram, where each vector represents 1024 audio sample points. Due to a sufficient compression ratio, we are able to apply arbitrary spatial interpolation of the Neuralgram to perform temporal stretching. Finally, a learned neural decoder synthesizes the time-scaled audio samples based on the stretched Neuralgram representation. Both the encoder and decoder are trained with latent regression losses and adversarial losses in order to obtain high-fidelity audio samples. Despite its simplicity, our method has comparable performance compared to the existing baselines and opens a new possibility in research into modern time-scale modification. Audio samples can be found at https://tsmnet-mmasia23.github.io
- MEMC-Net: Motion Estimation and Motion Compensation Driven Neural Network for Video Interpolation and Enhancement. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) (2021).
- FMA: A Dataset For Music Analysis. arXiv preprint arXiv:1612.01840 (2016).
- Adversarial Audio Synthesis. In International Conference on Learning Representations (ICLR).
- Jonathan Driedger and Meinard Müller. 2016. A Review of Time-Scale Modification of Music Signals. Applied Sciences (2016).
- O.M. Essenwanger. 1986. Elements of Statistical Analysis.
- J. L. Flanagan and R. M. Golden. 1966. Phase Vocoder. Bell System Technical Journal (1966).
- Generative Adversarial Nets. In Advances in Neural Information Processing Systems (NeurIPS).
- D. Griffin and Jae Lim. 1984. Signal estimation from modified short-time Fourier transform. IEEE Transactions on Acoustics, Speech, and Signal Processing (1984).
- Don Hejna and Bruce R Musicus. 1991. The SOLAFS time-scale modification algorithm. Bolt, Beranek and Newman (BBN) Technical Report (1991).
- Image-To-Image Translation With Conditional Adversarial Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
- FloWaveNet : A Generative Flow for Raw Audio. In Proceedings of Machine Learning Research (PMLR).
- Improved PVSOLA time-stretching and pitch-shifting for polyphonic audio. In Proceedings of the International Conference on Digital Audio Effects (DAFx).
- Mark A. Kramer. 1991. Nonlinear principal component analysis using autoassociative neural networks. AIChE Journal (1991).
- MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis. In Advances in Neural Information Processing Systems (NeurIPS).
- J. Laroche and M. Dolson. 1999. Improved phase vocoder time-scale modification of audio. IEEE Transactions on Speech and Audio Processing (1999).
- Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
- Jae Hyun Lim and Jong Chul Ye. 2017. Geometric GAN. arXiv preprint arXiv:1705.02894 (2017).
- Deep multi-scale video prediction beyond mean square error. arXiv preprint arXiv:1511.05440 (2016).
- Alexis Moinet and Thierry Dutoit. 2011. PVSOLA: A phase vocoder with synchronized overlap-add. In Proceedings of the International Conference on Digital Audio Effects (DAFx).
- WORLD: a vocoder-based high-quality speech synthesis system for real-time applications. IEICE Transactions on Information and Systems (2016).
- Eric Moulines and Francis Charpentier. 1990. Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones. Speech Communication (1990).
- Frederik Nagel and Andreas Walther. 2009. A Novel Transient Handling Scheme for Time Stretching Algorithms. In Audio Engineering Society Convention.
- Deconvolution and Checkerboard Artifacts. Distill (2016).
- Waveglow: A Flow-based Generative Network for Speech Synthesis. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
- Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. arXiv preprint arXiv:1511.06434 (2016).
- Tim Salimans and Durk P Kingma. 2016. Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks. In Advances in Neural Information Processing Systems (NeurIPS).
- A Survey on Neural Speech Synthesis. arXiv preprint arXiv:2106.15561 (2021).
- Invariances and Data Augmentation for Supervised Music Transcription. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
- WaveNet: A Generative Model for Raw Audio. arXiv preprint arXiv:1609.03499 (2016).
- W. Verhelst and M. Roelands. 1993. An overlap-add technique based on waveform similarity (WSOLA) for high quality time-scale modification of speech. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP).
- CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit (version 0.92).
- VocGAN: A High-Fidelity Real-time Vocoder with a Hierarchically-nested Adversarial Network. Interspeech (2020).