SMITIN: Self-Monitored Inference-Time INtervention for Generative Music Transformers (2404.02252v2)
Abstract: We introduce Self-Monitored Inference-Time INtervention (SMITIN), an approach for controlling an autoregressive generative music transformer using classifier probes. These simple logistic regression probes are trained on the output of each attention head in the transformer using a small dataset of audio examples both exhibiting and missing a specific musical trait (e.g., the presence/absence of drums, or real/synthetic music). We then steer the attention heads in the probe direction, ensuring the generative model output captures the desired musical trait. Additionally, we monitor the probe output to avoid adding an excessive amount of intervention into the autoregressive generation, which could lead to temporally incoherent music. We validate our results objectively and subjectively for both audio continuation and text-to-music applications, demonstrating the ability to add controls to large generative models for which retraining or even fine-tuning is impractical for most musicians. Audio samples of the proposed intervention approach are available on our demo page http://tinyurl.com/smitin .
- Musiclm: Generating music from text. arXiv preprint arXiv:2301.11325, October 2023.
- Understanding intermediate layers using linear classifier probes. arXiv preprint arXiv:1610.01644, 2016.
- Belinkov, Y. Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics, 48(1):207–219, 2022.
- Audiolm: a language modeling approach to audio generation. IEEE/ACM Trans. Audio, Speech, Lang. Process., 2023.
- Codified audio language modeling learns useful representations for music information retrieval. arXiv preprint arXiv:2107.05677, 2021.
- Simple and controllable music generation. arXiv preprint arXiv:2306.05284, 2023.
- High fidelity neural audio compression. arXiv preprint arXiv:2210.13438, 2022.
- Jukebox: A generative model for music. arXiv preprint arXiv:2005.00341, 2020.
- SingSong: Generating musical accompaniments from singing. arXiv preprint arXiv:2301.12662, 2023.
- VampNet: Music generation via masked acoustic token modeling. In Proc. ISMIR, 2023.
- Toward a recommendation for a european standard of peak and LKFS loudness levels. SMPTE Motion Imaging Journal, 119(3):28–34, 2010. doi: 10.5594/J11396.
- Adapting Fréchet audio distance for generative music evaluation. arXiv preprint arXiv:2311.01616, 2023.
- Multi-instrument music synthesis with spectrogram diffusion. In Proc. ISMIR, 2022.
- Inspecting and editing knowledge representations in language models. arXiv preprint arXiv:2304.00740, 2023.
- CNN architectures for large-scale audio classification. In Proc. ICASSP, pp. 131–135, 2017.
- Fréchet audio distance: A metric for evaluating music enhancement algorithms. arXiv preprint arXiv:1812.08466, 2018.
- Two data sets for tempo estimation and key detection in electronic dance music annotated from user corrections. In Proc. ISMIR, 2015.
- End-to-end music remastering system using self-supervised and adversarial training. In Proc. ICASSP, pp. 4608–4612, 2022.
- Audiogen: Textually guided audio generation. In Proc. ICLR, 2022.
- High-fidelity audio compression with improved RVQGAN. arXiv preprint arXiv:2306.06546, 2023.
- DISCO-10M: A large-scale music dataset. arXiv preprint arXiv:2306.13512, 2023.
- Evaluation of algorithms using games: The case of music tagging. In Proc. ISMIR, pp. 387–392, 2009.
- Controllable music production with diffusion models and guidance gradients. arXiv preprint arXiv:2311.00613, 2023.
- Inference-time intervention: Eliciting truthful answers from a language model. arXiv preprint arXiv:2306.03341, 2023.
- Content-based controls for music large language modeling. arXiv preprint arXiv:2310.17162, 2023.
- Arrange, inpaint, and refine: Steerable long-term music audio generation and editing via content-based controls. arXiv preprint arXiv:2402.09508, 2024.
- AudioLDM 2: Learning holistic audio generation with self-supervised pretraining. arXiv preprint arXiv:2308.05734, 2023.
- DITTO: Diffusion inference-time t-optimization for music generation. arXiv preprint arXiv:2401.12179, 2024.
- MoisesDB: A dataset for source separation beyond 4-stems. arXiv preprint arXiv:2307.15913, 2023.
- MUSDB18-HQ - an uncompressed version of MUSDB18, August 2019. URL https://doi.org/10.5281/zenodo.3338373.
- Crowdmos: An approach for crowdsourcing mean opinion score studies. In Proc. ICASSP, pp. 2416–2419, 2011.
- 1000 songs for emotional analysis of music. In Proc. ACM Workshop on Crowdsourcing for Multimedia (CrowdMM), pp. 1–6, 2013.
- Extracting latent steering vectors from pretrained language models. In Proc. ACL, pp. 566–581, 2022.
- BERT rediscovers the classical NLP pipeline. In Proc. ACL, pp. 4593–4601, 2019.
- Activation addition: Steering language models without optimization. arXiv preprint arXiv:2308.10248, 2023.
- Musical genre classification of audio signals. IEEE Trans. Speech Audio Process., 10(5):293–302, 2002.
- Attention is all you need. In Proc. NeurIPS, volume 30, 2017.
- Music controlnet: Multiple time-varying controls for music generation. arXiv preprint arXiv:2311.07069, 2023a.
- Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In Proc. ICASSP, pp. 1–5, 2023b.
- Soundstream: An end-to-end neural audio codec. IEEE/ACM Trans. Audio, Speech, Lang. Process., 30:495–507, 2021.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.