Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 144 tok/s
Gemini 2.5 Pro 46 tok/s Pro
GPT-5 Medium 22 tok/s Pro
GPT-5 High 24 tok/s Pro
GPT-4o 84 tok/s Pro
Kimi K2 200 tok/s Pro
GPT OSS 120B 432 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

SMITIN: Self-Monitored Inference-Time INtervention for Generative Music Transformers (2404.02252v2)

Published 2 Apr 2024 in cs.SD and eess.AS

Abstract: We introduce Self-Monitored Inference-Time INtervention (SMITIN), an approach for controlling an autoregressive generative music transformer using classifier probes. These simple logistic regression probes are trained on the output of each attention head in the transformer using a small dataset of audio examples both exhibiting and missing a specific musical trait (e.g., the presence/absence of drums, or real/synthetic music). We then steer the attention heads in the probe direction, ensuring the generative model output captures the desired musical trait. Additionally, we monitor the probe output to avoid adding an excessive amount of intervention into the autoregressive generation, which could lead to temporally incoherent music. We validate our results objectively and subjectively for both audio continuation and text-to-music applications, demonstrating the ability to add controls to large generative models for which retraining or even fine-tuning is impractical for most musicians. Audio samples of the proposed intervention approach are available on our demo page http://tinyurl.com/smitin .

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. Musiclm: Generating music from text. arXiv preprint arXiv:2301.11325, October 2023.
  2. Understanding intermediate layers using linear classifier probes. arXiv preprint arXiv:1610.01644, 2016.
  3. Belinkov, Y. Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics, 48(1):207–219, 2022.
  4. Audiolm: a language modeling approach to audio generation. IEEE/ACM Trans. Audio, Speech, Lang. Process., 2023.
  5. Codified audio language modeling learns useful representations for music information retrieval. arXiv preprint arXiv:2107.05677, 2021.
  6. Simple and controllable music generation. arXiv preprint arXiv:2306.05284, 2023.
  7. High fidelity neural audio compression. arXiv preprint arXiv:2210.13438, 2022.
  8. Jukebox: A generative model for music. arXiv preprint arXiv:2005.00341, 2020.
  9. SingSong: Generating musical accompaniments from singing. arXiv preprint arXiv:2301.12662, 2023.
  10. VampNet: Music generation via masked acoustic token modeling. In Proc. ISMIR, 2023.
  11. Toward a recommendation for a european standard of peak and LKFS loudness levels. SMPTE Motion Imaging Journal, 119(3):28–34, 2010. doi: 10.5594/J11396.
  12. Adapting Fréchet audio distance for generative music evaluation. arXiv preprint arXiv:2311.01616, 2023.
  13. Multi-instrument music synthesis with spectrogram diffusion. In Proc. ISMIR, 2022.
  14. Inspecting and editing knowledge representations in language models. arXiv preprint arXiv:2304.00740, 2023.
  15. CNN architectures for large-scale audio classification. In Proc. ICASSP, pp.  131–135, 2017.
  16. Fréchet audio distance: A metric for evaluating music enhancement algorithms. arXiv preprint arXiv:1812.08466, 2018.
  17. Two data sets for tempo estimation and key detection in electronic dance music annotated from user corrections. In Proc. ISMIR, 2015.
  18. End-to-end music remastering system using self-supervised and adversarial training. In Proc. ICASSP, pp.  4608–4612, 2022.
  19. Audiogen: Textually guided audio generation. In Proc. ICLR, 2022.
  20. High-fidelity audio compression with improved RVQGAN. arXiv preprint arXiv:2306.06546, 2023.
  21. DISCO-10M: A large-scale music dataset. arXiv preprint arXiv:2306.13512, 2023.
  22. Evaluation of algorithms using games: The case of music tagging. In Proc. ISMIR, pp.  387–392, 2009.
  23. Controllable music production with diffusion models and guidance gradients. arXiv preprint arXiv:2311.00613, 2023.
  24. Inference-time intervention: Eliciting truthful answers from a language model. arXiv preprint arXiv:2306.03341, 2023.
  25. Content-based controls for music large language modeling. arXiv preprint arXiv:2310.17162, 2023.
  26. Arrange, inpaint, and refine: Steerable long-term music audio generation and editing via content-based controls. arXiv preprint arXiv:2402.09508, 2024.
  27. AudioLDM 2: Learning holistic audio generation with self-supervised pretraining. arXiv preprint arXiv:2308.05734, 2023.
  28. DITTO: Diffusion inference-time t-optimization for music generation. arXiv preprint arXiv:2401.12179, 2024.
  29. MoisesDB: A dataset for source separation beyond 4-stems. arXiv preprint arXiv:2307.15913, 2023.
  30. MUSDB18-HQ - an uncompressed version of MUSDB18, August 2019. URL https://doi.org/10.5281/zenodo.3338373.
  31. Crowdmos: An approach for crowdsourcing mean opinion score studies. In Proc. ICASSP, pp.  2416–2419, 2011.
  32. 1000 songs for emotional analysis of music. In Proc. ACM Workshop on Crowdsourcing for Multimedia (CrowdMM), pp.  1–6, 2013.
  33. Extracting latent steering vectors from pretrained language models. In Proc. ACL, pp.  566–581, 2022.
  34. BERT rediscovers the classical NLP pipeline. In Proc. ACL, pp.  4593–4601, 2019.
  35. Activation addition: Steering language models without optimization. arXiv preprint arXiv:2308.10248, 2023.
  36. Musical genre classification of audio signals. IEEE Trans. Speech Audio Process., 10(5):293–302, 2002.
  37. Attention is all you need. In Proc. NeurIPS, volume 30, 2017.
  38. Music controlnet: Multiple time-varying controls for music generation. arXiv preprint arXiv:2311.07069, 2023a.
  39. Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In Proc. ICASSP, pp.  1–5, 2023b.
  40. Soundstream: An end-to-end neural audio codec. IEEE/ACM Trans. Audio, Speech, Lang. Process., 30:495–507, 2021.
Citations (2)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Open Questions

We haven't generated a list of open questions mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 4 tweets and received 30 likes.

Upgrade to Pro to view all of the tweets about this paper: