Example-Based Framework for Perceptually Guided Audio Texture Generation (2308.11859v2)
Abstract: Controllable generation using StyleGANs is usually achieved by training the model using labeled data. For audio textures, however, there is currently a lack of large semantically labeled datasets. Therefore, to control generation, we develop a method for semantic control over an unconditionally trained StyleGAN in the absence of such labeled datasets. In this paper, we propose an example-based framework to determine guidance vectors for audio texture generation based on user-defined semantic attributes. Our approach leverages the semantically disentangled latent space of an unconditionally trained StyleGAN. By using a few synthetic examples to indicate the presence or absence of a semantic attribute, we infer the guidance vectors in the latent space of the StyleGAN to control that attribute during generation. Our results show that our framework can find user-defined and perceptually relevant guidance vectors for controllable generation for audio textures. Furthermore, we demonstrate an application of our framework to other tasks, such as selective semantic attribute transfer.
- J. H. McDermott and E. P. Simoncelli, “Sound texture perception via statistics of the auditory periphery: evidence from sound synthesis,” Neuron, vol. 71, no. 5, pp. 926–940, 2011.
- D. B. Anderson and M. A. Casey, “The sound dimension,” IEEE spectrum, vol. 34, no. 3, pp. 46–50, 1997.
- D. Moffat, R. Selfridge, and J. D. Reiss, “Sound effect synthesis,” in Foundations in Sound Design for Interactive Media. Routledge, 2019, pp. 274–299.
- I. J. Goodfellow, “NIPS 2016 tutorial: Generative adversarial networks,” arXiv preprint arXiv:1701.00160, vol. abs/1701.00160, 2017.
- T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila, “Analyzing and improving the image quality of stylegan,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 8110–8119.
- T. Karras, S. Laine, and T. Aila, “A style-based generator architecture for generative adversarial networks,” in Proc. of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 4401–4410.
- K. N. Watcharasupat, “Controllable music: supervised learning of disentangled representations for music generation,” 2021.
- F. Kreuk, G. Synnaeve, A. Polyak, U. Singer, A. Défossez, J. Copet, D. Parikh, Y. Taigman, and Y. Adi, “Audiogen: Textually guided audio generation,” in The Eleventh International Conference on Learning Representations, 2023.
- Z. Borsos, R. Marinier, D. Vincent, E. Kharitonov, O. Pietquin, M. Sharifi, D. Roblek, O. Teboul, D. Grangier, M. Tagliasacchi, and N. Zeghidour, “Audiolm: A language modeling approach to audio generation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 2523–2533, 2023.
- A. Agostinelli, T. I. Denk, Z. Borsos, J. Engel, M. Verzetti, A. Caillon, Q. Huang, A. Jansen, A. Roberts, M. Tagliasacchi, M. Sharifi, N. Zeghidour, and C. Frank, “Musiclm: Generating music from text,” 2023.
- H. Liu, Z. Chen, Y. Yuan, X. Mei, X. Liu, D. Mandic, W. Wang, and M. D. Plumbley, “Audioldm: Text-to-audio generation with latent diffusion models,” arXiv preprint arXiv:2301.12503, 2023.
- B. Elizalde, S. Deshmukh, M. A. Ismail, and H. Wang, “Clap: Learning audio concepts from natural language supervision,” arXiv preprint arXiv:2206.04769, 2022.
- A. Guzhov, F. Raue, J. Hees, and A. Dengel, “Audioclip: Extending clip to image, text and audio,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 976–980.
- P. Grosche, M. Müller, and J. Serra, “Audio content-based music retrieval,” in Dagstuhl Follow-Ups, vol. 3. Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2012.
- J. T. Foote, “Content-based retrieval of music and audio,” in Multimedia storage and archiving systems II, vol. 3229. SPIE, 1997, pp. 138–147.
- A. Wang, “The shazam music recognition service,” Communications of the ACM, vol. 49, no. 8, pp. 44–48, 2006.
- W. P. Birmingham, “MUSART: music retrieval via aural queries,” in ISMIR 2001, 2nd International Symposium on Music Information Retrieval, Indiana University, Bloomington, Indiana, USA, October 15-17, 2001, Proceedings, 2001.
- A. Ghias, J. Logan, D. Chamberlin, and B. C. Smith, “Query by humming: Musical information retrieval in an audio database,” in Proceedings of the third ACM international conference on Multimedia, 1995, pp. 231–236.
- M. Cartwright and B. Pardo, “Synthassist: An audio synthesizer programmed with vocal imitation,” in Proceedings of the 22nd ACM International Conference on Multimedia, ser. MM ’14. New York, NY, USA: Association for Computing Machinery, 2014, p. 741–742.
- B. Kim and B. Pardo, “Improving content-based audio retrieval by vocal imitation feedback,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 4100–4104.
- Y. Zhang, B. Pardo, and Z. Duan, “Siamese style convolutional neural networks for sound search by vocal imitation,” IEEE/ACM Trans. on Audio, Speech, and Language Processing, vol. 27, no. 2, pp. 429–441, 2018.
- J. Engel, C. Resnick, A. Roberts, S. Dieleman, M. Norouzi, D. Eck, and K. Simonyan, “Neural audio synthesis of musical notes with wavenet autoencoders,” in International Conference on Machine Learning. PMLR, 2017, pp. 1068–1077.
- J. Engel, K. K. Agrawal, S. Chen, I. Gulrajani, C. Donahue, and A. Roberts, “GANSynth: Adversarial neural audio synthesis,” in International Conference on Learning Representations, 2019.
- C. Y. Lee, A. Toffy, G. J. Jung, and W.-J. Han, “Conditional wavegan,” arXiv preprint arXiv:1809.10636, 2018.
- J. Nistal, S. Lattner, and G. Richard, “Drumgan: Synthesis of drum sounds with timbral feature conditioning using generative adversarial networks,” in International Society for Music Information Retrieval Conference, 2020.
- M. T. Group, “Audio commons audio extractor,” 2018. [Online]. Available: https://www.audiocommons.org/2018/07/15/audio-commons-audio-extractor.html
- ——, “Essentia: Open-source c++ library for audio analysis and audio-based music information retrieval,” 2022. [Online]. Available: https://essentia.upf.edu/
- J. Engel, L. H. Hantrakul, C. Gu, and A. Roberts, “Ddsp: Differentiable digital signal processing,” in International Conference on Learning Representations, 2020.
- Y. Liu, C. Jin, and D. Gunawan, “Ddsp-sfx: Acoustically-guided sound effects generation with differentiable digital signal processing,” arXiv preprint arXiv:2309.08060, 2023.
- W. W. Gaver, “How do we hear in the world? explorations in ecological acoustics,” Ecological psychology, vol. 5, no. 4, pp. 285–313, 1993.
- ——, “What in the world do we hear?: An ecological approach to auditory event perception,” Ecological psychology, vol. 5, no. 1, pp. 1–29, 1993.
- C. Gupta, P. Kamath, Y. Wei, Z. Li, S. Nanayakkara, and L. Wyse, “Towards controllable audio texture morphing,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
- J. Nistal, S. Lattner, and G. Richard, “Darkgan: Exploiting knowledge distillation for comprehensible audio synthesis with gans,” in International Society for Music Information Retrieval Conference, 2021.
- Q. Kong, Y. Cao, T. Iqbal, Y. Wang, W. Wang, and M. D. Plumbley, “Panns: Large-scale pretrained audio neural networks for audio pattern recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 2880–2894, 2020.
- J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” in IEEE international conference on acoustics, speech and signal processing (ICASSP), 2017, pp. 776–780.
- L. Wyse, P. Kamath, and C. Gupta, “Sound model factory: An integrated system architecture for generative audio modelling,” in Artificial Intelligence in Music, Sound, Art and Design: 11th International Conference, EvoMUSART 2022, Held as Part of EvoStar 2022, Madrid, Spain, April 20–22, 2022, Proceedings, Springer. Springer, 2022, pp. 308–322.
- E. Härkönen, A. Hertzmann, J. Lehtinen, and S. Paris, “Ganspace: Discovering interpretable gan controls,” Advances in Neural Information Processing Systems, vol. 33, pp. 9841–9850, 2020.
- Y. Shen and B. Zhou, “Closed-form factorization of latent semantics in gans,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2021, pp. 1532–1540.
- K. Tahiroglu, M. Kastemaa, and O. Koli, “Ganspacesynth: A hybrid generative adversarial network architecture for organising the latent space using a dimensionality reduction for real-time audio synthesis,” in Proceedings of the 2nd Joint Conference on AI Music Creativity, 2021.
- K. N. Haque, R. Rana, J. Liu, J. H. L. Hansen, N. Cummins, C. Busso, and B. W. Schuller, “Guided generative adversarial neural network for representation learning and audio generation using fewer labelled audio data,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 2575–2590, 2021.
- K. N. Haque, R. Rana, and B. W. Schuller, “High-fidelity audio generation and representation learning with guided adversarial autoencoder,” IEEE Access, vol. 8, pp. 223 509–223 528, 2020.
- R. Parihar, A. Dhiman, T. Karmali, and V. R, “Everything is there in latent space: Attribute editing and attribute style manipulation by stylegan latent space exploration,” in Proceedings of the 30th ACM International Conference on Multimedia, ser. MM ’22. New York, NY, USA: Association for Computing Machinery, 2022, p. 1828–1836.
- L. Chai, J. Wulff, and P. Isola, “Using latent space regression to analyze and leverage compositionality in gans,” in International Conference on Learning Representations. Virtual Only: ICLR, 2021.
- J. Snell, K. Swersky, and R. Zemel, “Prototypical networks for few-shot learning,” in Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds., vol. 30. Curran Associates, Inc., 2017.
- X. Serra and J. Smith, “Spectral modeling synthesis: A sound analysis/synthesis system based on a deterministic plus stochastic decomposition,” Computer Music Journal, vol. 14, no. 4, pp. 12–24, 1990.
- X. Serra et al., “Musical sound modeling with sinusoids plus noise,” Musical signal processing, pp. 91–122, 1997.
- P. Esling, A. Chemla-Romeu-Santos, and A. Bitton, “Generative timbre spaces: regularizing variational auto-encoders with perceptual metrics,” arXiv preprint arXiv:1805.08501, 2018.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
- S. Hershey, S. Chaudhuri, D. P. W. Ellis, J. F. Gemmeke, A. Jansen, R. C. Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seybold, M. Slaney, R. J. Weiss, and K. Wilson, “Cnn architectures for large-scale audio classification,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 131–135.
- P.-Y. Huang, H. Xu, J. Li, A. Baevski, M. Auli, W. Galuba, F. Metze, and C. Feichtenhofer, “Masked autoencoders that listen,” Advances in Neural Information Processing Systems, vol. 35, pp. 28 708–28 720, 2022.
- D. Niizumi, D. Takeuchi, Y. Ohishi, N. Harada, and K. Kashino, “Masked spectrogram modeling using masked autoencoders for learning general-purpose audio representation,” in HEAR: Holistic Evaluation of Audio Representations. PMLR, 2022, pp. 1–24.
- R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 586–595.
- A. Owens, P. Isola, J. McDermott, A. Torralba, E. H. Adelson, and W. T. Freeman, “Visually indicated sounds,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA, USA: IEEE Computer Society, jun 2016, pp. 2405–2413.
- C. Gupta, Y. Wei, Z. Gong, P. Kamath, Z. Li, and L. Wyse, “Parameter sensitivity of deep-feature based evaluation metrics for audio textures,” in Proceedings of the 23rd International Society for Music Information Retrieval Conference, ISMIR, 2022, pp. 462–468.
- A. Marafioti, N. Perraudin, N. Holighaus, and P. Majdak, “Adversarial generation of time-frequency features with application in audio synthesis,” in International conference on machine learning, 2019, pp. 4352–4362.
- Z. Prusa, P. Balazs, and P. L. Sondergaard, “A noniterative method for reconstruction of phase from stft magnitude,” IEEE/ACM Trans. on Audio, Speech, and Language Processing, vol. 25, no. 5, pp. 1154–1164, 2017.
- C. Gupta, P. Kamath, and L. Wyse, “Signal representations for synthesizing audio textures with generative adversarial networks,” arXiv preprint arXiv:2103.07390, 2021.
- L. McInnes, J. Healy, and J. Melville, “Umap: Uniform manifold approximation and projection for dimension reduction,” arXiv preprint arXiv:1802.03426, 2018.
- K. Kilgour, M. Zuluaga, D. Roblek, and M. Sharifi, “Fréchet audio distance: A reference-free metric for evaluating music enhancement algorithms.” in INTERSPEECH, 2019, pp. 2350–2354.
- G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4700–4708.
- K. Palanisamy, D. Singhania, and A. Yao, “Rethinking CNN models for audio classification,” CoRR, vol. abs/2007.11154, 2020.
- M. Cartwright, B. Pardo, G. J. Mysore, and M. Hoffman, “Fast and easy crowdsourced perceptual audio evaluation,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016, pp. 619–623.
- P. Kamath, Z. Li, C. Gupta, K. Jaidka, S. Nanayakkara, and L. Wyse, “Evaluating descriptive quality of ai-generated audio using image-schemas,” in Proceedings of the 28th International Conference on Intelligent User Interfaces, ser. IUI ’23. New York, NY, USA: Association for Computing Machinery, 2023, p. 621–632.
- J. Choi, J. Lee, C. Yoon, J. H. Park, G. Hwang, and M. Kang, “Do not escape from the manifold: Discovering the local coordinates on the latent space of GANs,” in International Conference on Learning Representations, 2022.
- L. Wyse and P. T. Ravikumar, “Syntex: parametric audio texture datasets for conditional training of instrumental interfaces.” in NIME 2022. PubPub, 2022.