Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
143 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MR-MT3: Memory Retaining Multi-Track Music Transcription to Mitigate Instrument Leakage (2403.10024v1)

Published 15 Mar 2024 in cs.SD, cs.AI, cs.LG, cs.MM, and eess.AS

Abstract: This paper presents enhancements to the MT3 model, a state-of-the-art (SOTA) token-based multi-instrument automatic music transcription (AMT) model. Despite SOTA performance, MT3 has the issue of instrument leakage, where transcriptions are fragmented across different instruments. To mitigate this, we propose MR-MT3, with enhancements including a memory retention mechanism, prior token sampling, and token shuffling are proposed. These methods are evaluated on the Slakh2100 dataset, demonstrating improved onset F1 scores and reduced instrument leakage. In addition to the conventional multi-instrument transcription F1 score, new metrics such as the instrument leakage ratio and the instrument detection F1 score are introduced for a more comprehensive assessment of transcription quality. The study also explores the issue of domain overfitting by evaluating MT3 on single-instrument monophonic datasets such as ComMU and NSynth. The findings, along with the source code, are shared to facilitate future work aimed at refining token-based multi-instrument AMT models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (27)
  1. Automatic music transcription: An overview. IEEE Signal Processing Magazine, 36:20–30, 2019.
  2. Deep salience representations for F0 estimation in polyphonic music. In ISMIR, 2017.
  3. A lightweight instrument-agnostic model for polyphonic note transcription and multipitch estimation. In ICASSP, 2022.
  4. Pix2seq: A language modeling framework for object detection. In ICLR, 2021.
  5. ReconVAT: A semi-supervised automatic music transcription framework for low-resource real-world data. In ACM MM, 2021.
  6. Jointist: Simultaneous improvement of multi-instrument transcription and music source separation via joint training. arXiv preprint arXiv:2302.00286, 2023.
  7. Diffroll: Diffusion-based generative music transcription with unsupervised pretraining capability. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023.
  8. Timbre-trap: A low-resource framework for instrument-agnostic music transcription. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), page in press. IEEE, 2024.
  9. Neural audio synthesis of musical notes with wavenet autoencoders, 2017.
  10. Towards cross-version harmonic analysis of music. IEEE Transactions on Multimedia, 14(3):770–782, 2012.
  11. MT3: Multi-task multitrack music transcription. In ICLR, 2022.
  12. Onsets and frames: Dual-objective piano transcription. In ISMIR, 2018.
  13. Sequence-to-sequence piano transcription with transformers. In ISMIR, 2021.
  14. Multi-instrument music synthesis with spectrogram diffusion. ISMIR, 2022.
  15. Deep polyphonic adsr piano note transcription. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 246–250. IEEE, 2019.
  16. High-resolution piano transcription with pedals by regressing onset and offset times. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3707–3717, 2021.
  17. Commu: Dataset for combinatorial music generation. Advances in Neural Information Processing Systems, 35:39103–39114, 2022.
  18. Mart: Memory-augmented recurrent transformer for coherent video paragraph captioning. arXiv preprint arXiv:2005.05402, 2020.
  19. Multitrack music transcription with a time-frequency perceiver. In ICASSP, 2023.
  20. Improving music transcription by pre-stacking a U-Net. In ICASSP, 2020.
  21. Yingdong Ru. Computer assisted chord detection using deep learning and yolov4 neural network model. In Journal of Physics: Conference Series, volume 2083, page 042017. IOP Publishing, 2021.
  22. Multi-instrument music transcription based on deep spherical clustering of spectrograms and pitchgrams. In ISMIR, pages 327–334, 2020.
  23. Learning features of music from scratch. In ICLR, 2017.
  24. A music similarity measure based on chord progression and song segmentation analysis. In 2014 Fourth International Conference on Digital Information and Communication Technology and its Applications (DICTAP), pages 158–163, 2014.
  25. Polyphonic music transcription with semantic segmentation. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 166–170. IEEE, 2019.
  26. Multi-instrument automatic music transcription with self-attention-based instance segmentation. IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP), 28:2796–2809, 2020.
  27. Memvit: Memory-augmented multiscale vision transformer for efficient long-term video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13587–13597, 2022.

Summary

We haven't generated a summary for this paper yet.