Style Transfer for Non-differentiable Audio Effects (2309.17125v1)
Abstract: Digital audio effects are widely used by audio engineers to alter the acoustic and temporal qualities of audio data. However, these effects can have a large number of parameters which can make them difficult to learn for beginners and hamper creativity for professionals. Recently, there have been a number of efforts to employ progress in deep learning to acquire the low-level parameter configurations of audio effects by minimising an objective function between an input and reference track, commonly referred to as style transfer. However, current approaches use inflexible black-box techniques or require that the effects under consideration are implemented in an auto-differentiation framework. In this work, we propose a deep learning approach to audio production style matching which can be used with effects implemented in some of the most widely used frameworks, requiring only that the parameters under consideration have a continuous domain. Further, our method includes style matching for various classes of effects, many of which are difficult or impossible to be approximated closely using differentiable functions. We show that our audio embedding approach creates logical encodings of timbral information, which can be used for a number of downstream tasks. Further, we perform a listening test which demonstrates that our approach is able to convincingly style match a multi-band compressor effect.
- Engel, J., Hantrakul, L., Gu, C., and Roberts, A., “DDSP: Differentiable digital signal processing,” arXiv preprint arXiv:2001.04643, 2020.
- Steinmetz, C. J., Bryan, N. J., and Reiss, J. D., “Style transfer of audio effects with differentiable signal processing,” arXiv preprint arXiv:2207.08759, 2022.
- Vanhatalo, T., Legrand, P., Desainte-Catherine, M., Hanna, P., Brusco, A., Pille, G., and Bayle, Y., “A Review of Neural Network-Based Emulation of Guitar Amplifiers,” Applied Sciences, 12(12), p. 5894, 2022.
- Nercessian, S., Sarroff, A., and Werner, K. J., “Lightweight and interpretable neural modeling of an audio distortion effect using hyperconditioned differentiable biquads,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 890–894, IEEE, 2021.
- Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mohamed, S., and Lerchner, A., “beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework,” in International Conference on Learning Representations, 2017.
- Koch, G., Zemel, R., Salakhutdinov, R., et al., “Siamese neural networks for one-shot image recognition,” in ICML deep learning workshop, volume 2, p. 0, Lille, 2015.
- Ramírez, M. A. M., Wang, O., Smaragdis, P., and Bryan, N. J., “Differentiable signal processing with black-box audio effects,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 66–70, IEEE, 2021.
- Arık, S. Ö., Chrzanowski, M., Coates, A., Diamos, G., Gibiansky, A., Kang, Y., Li, X., Miller, J., Ng, A., Raiman, J., et al., “Deep voice: Real-time neural text-to-speech,” in International conference on machine learning, pp. 195–204, PMLR, 2017.
- Laurier, C., Grivolla, J., and Herrera, P., “Multimodal music mood classification using audio and lyrics,” in 2008 seventh international conference on machine learning and applications, pp. 688–693, IEEE, 2008.
- Damskägg, E.-P., Juvela, L., Thuillier, E., and Välimäki, V., “Deep learning for tube amplifier emulation,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 471–475, IEEE, 2019.
- Spall, J. C., “An overview of the simultaneous perturbation method for efficient optimization,” Johns Hopkins apl technical digest, 19(4), pp. 482–492, 1998.
- Sheng, D. and Fazekas, G., “A feature learning siamese model for intelligent control of the dynamic range compressor,” in 2019 International Joint Conference on Neural Networks (IJCNN), pp. 1–8, IEEE, 2019.
- Mimilakis, S. I., Bryan, N. J., and Smaragdis, P., “One-shot parametric audio production style transfer with application to frequency equalization,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 256–260, IEEE, 2020.
- Fu, H., Li, C., Liu, X., Gao, J., Celikyilmaz, A., and Carin, L., “Cyclical annealing schedule: A simple approach to mitigating kl vanishing,” arXiv preprint arXiv:1903.10145, 2019.
- Steinmetz, C. J. and Reiss, J. D., “auraloss: Audio focused loss functions in PyTorch,” in Digital music research network one-day workshop (DMRN+ 15), 2020.
- Veaux, C., Yamagishi, J., MacDonald, K., et al., “CSTR VCTK corpus: English multi-speaker corpus for CSTR voice cloning toolkit,” University of Edinburgh. The Centre for Speech Technology Research (CSTR), 2017.
- Mysore, G. J., “Can we automatically transform speech recorded on common consumer devices in real-world environments into professional production quality speech?—a dataset, insights, and challenges,” IEEE Signal Processing Letters, 22(8), pp. 1006–1010, 2014.
- Rafii, Z., Liutkus, A., Stöter, F.-R., Mimilakis, S. I., and Bittner, R., “The MUSDB18 corpus for music separation,” 2017.
- Recommendation, I., “Wideband extension to Recommendation P. 862 for the assessment of wideband telephone networks and speech codecs,” Rec. ITU-T P, 862, 2007.
- Schoeffler, M., Bartoschek, S., Stöter, F.-R., Roess, M., Westphal, S., Edler, B., and Herre, j. o. O. R. S., Jürgen, “webMUSHRA—A comprehensive framework for web-based listening tests,” 6(1), 2018.
- Esling, P., Chemla-Romeu-Santos, A., and Bitton, A., “Bridging Audio Analysis, Perception and Synthesis with Perceptually-regularized Variational Timbre Spaces.” in ISMIR, pp. 175–181, 2018.