Expressive Acoustic Guitar Sound Synthesis with an Instrument-Specific Input Representation and Diffusion Outpainting (2401.13498v1)
Abstract: Synthesizing performing guitar sound is a highly challenging task due to the polyphony and high variability in expression. Recently, deep generative models have shown promising results in synthesizing expressive polyphonic instrument sounds from music scores, often using a generic MIDI input. In this work, we propose an expressive acoustic guitar sound synthesis model with a customized input representation to the instrument, which we call guitarroll. We implement the proposed approach using diffusion-based outpainting which can generate audio with long-term consistency. To overcome the lack of MIDI/audio-paired datasets, we used not only an existing guitar dataset but also collected data from a high quality sample-based guitar synthesizer. Through quantitative and qualitative evaluations, we show that our proposed model has higher audio quality than the baseline model and generates more realistic timbre sounds than the previous leading work.
- Diemo Schwarz, “A System for Data-Driven Concatenative Sound Synthesis,” in Digital Audio Effects (DAFx), Verona, Italy, Dec. 2000, pp. 97–102, cote interne IRCAM: Schwarz00c.
- “Methods for modeling realistic playing in acoustic guitar synthesis,” Computer Music Journal, vol. 25, pp. 38–49, 09 2001.
- “Conditioning deep generative raw audio models for structured automatic music,” in ISMIR, 2018.
- “Performancenet: Score-to-audio music generation with multi-band convolutional residual network,” Proc. AAAI, 07 2019.
- “Neural music synthesis for flexible timbre control,” in ICASSP, 2019.
- “Deep performer: Score-to-audio music performance synthesis,” in ICASSP, 2022.
- “Deep multi-scale video prediction beyond mean square error,” in 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, Yoshua Bengio and Yann LeCun, Eds., 2016.
- “F-gan: Training generative neural samplers using variational divergence minimization,” Red Hook, NY, USA, 2016, NIPS’16, p. 271–279, Curran Associates Inc.
- “GANSynth: Adversarial neural audio synthesis,” in ICLR, 2019.
- “Ganstrument: Adversarial instrument sound synthesis with pitch-invariant instance conditioning,” in ICASSP, 2023.
- “Neural audio synthesis of musical notes with wavenet autoencoders,” CoRR, vol. abs/1704.01279, 2017.
- “MIDI-DDSP: Detailed control of musical performance via hierarchical modeling,” in ICLR, 2022.
- “Denoising diffusion probabilistic models,” in NeurIPS, 2020.
- “Multi-instrument music synthesis with spectrogram diffusion,” in ISMIR, 2022.
- “Guitarset: A dataset for guitar transcription,” in ISMIR, 2018.
- “Exploring the limits of transfer learning with a unified text-to-text transformer,” Journal of Machine Learning Research, 2020.
- “Soundstream: An end-to-end neural audio codec,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021.
- “Elucidating the design space of diffusion-based generative models,” in Proc. NeurIPS, 2022.
- “Score-based generative modeling through stochastic differential equations,” in ICLR, 2021.
- “Improving diffusion models for inverse problems using manifold constraints,” in NeurIPS, 2022.
- “Repaint: Inpainting using denoising diffusion probabilistic models,” CoRR, vol. abs/2201.09865, 2022.
- “Denoising diffusion restoration models,” in NeurIPS, 2022.
- Colin Raffel, Learning-Based Methods for Comparing Sequences, with Applications to Audio-to-MIDI Alignment and Matching, Phd thesis, Columbia University, 2016.
- “Classifier-free diffusion guidance,” in NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021.
- “Fréchet Audio Distance: A Reference-Free Metric for Evaluating Music Enhancement Algorithms,” in Proc. Interspeech, 2019.
- “Cnn architectures for large-scale audio classification,” in ICASSP, 2017, pp. 131–135.
- “Panns: Large-scale pretrained audio neural networks for audio pattern recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2020.
- “MT3: Multi-task multitrack music transcription,” in ICLR, 2022.
- “Cutting music source separation some Slakh: A dataset to study the impact of training data quality and quantity,” in Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). IEEE, 2019.
- Hounsu Kim (3 papers)
- Soonbeom Choi (3 papers)
- Juhan Nam (64 papers)