Towards Efficient and Real-Time Piano Transcription Using Neural Autoregressive Models (2404.06818v1)
Abstract: In recent years, advancements in neural network designs and the availability of large-scale labeled datasets have led to significant improvements in the accuracy of piano transcription models. However, most previous work focused on high-performance offline transcription, neglecting deliberate consideration of model size. The goal of this work is to implement real-time inference for piano transcription while ensuring both high performance and lightweight. To this end, we propose novel architectures for convolutional recurrent neural networks, redesigning an existing autoregressive piano transcription model. First, we extend the acoustic module by adding a frequency-conditioned FiLM layer to the CNN module to adapt the convolutional filters on the frequency axis. Second, we improve note-state sequence modeling by using a pitchwise LSTM that focuses on note-state transitions within a note. In addition, we augment the autoregressive connection with an enhanced recursive context. Using these components, we propose two types of models; one for high performance and the other for high compactness. Through extensive experiments, we show that the proposed models are comparable to state-of-the-art models in terms of note accuracy on the MAESTRO dataset. We also investigate the effective model size and real-time inference latency by gradually streamlining the architecture. Finally, we conduct cross-data evaluation on unseen piano datasets and in-depth analysis to elucidate the effect of the proposed components in the view of note length and pitch range.
- E. Benetos, S. Dixon, Z. Duan, and S. Ewert, “Automatic music transcription: An overview,” IEEE Signal Processing Magazine, vol. 36, no. 1, pp. 20–30, 2018.
- V. Emiya, N. Bertin, B. David, and R. Badeau, “Maps-a piano database for multipitch estimation and automatic transcription of music,” 2010.
- C. Hawthorne, A. Stasyuk, A. Roberts, I. Simon, C.-Z. A. Huang, S. Dieleman, E. Elsen, J. Engel, and D. Eck, “Enabling factorized piano music modeling and generation with the MAESTRO dataset,” in International Conference on Learning Representations, 2019.
- C. Hawthorne, E. Elsen, J. Song, A. Roberts, I. Simon, C. Raffel, J. Engel, S. Oore, and D. Eck, “Onsets and frames: Dual-objective piano transcription,” in Proc. of the 19th International Society for Music Information Retrieval Conference (ISMIR), October 2018, pp. 50–57.
- J. Kim and J. Bello, “Adversarial learning for improved onsets and frames music transcription,” in Proc. of 20th International Society for Music Information Retrieval Conference (ISMIR), 2019, pp. 670–677.
- R. Kelz, S. Böck, and G. Widmer, “Deep polyphonic ADSR piano note transcription,” in Proc. of the 44th International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, May 2019, pp. 246–250.
- Q. Kong, B. Li, X. Song, Y. Wan, and Y. Wang, “High-resolution piano transcription with pedals by regressing onset and offset times,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3707–3717, 2021.
- J. J. Carabias-Orti, F. J. Rodríguez-Serrano, P. Vera-Candeas, F. J. Cañadas-Quesada, and N. Ruiz-Reyes, “Constrained non-negative sparse coding using learnt instrument templates for realtime music transcription,” Engineering Applications of Artificial Intelligence, vol. 26, no. 7, pp. 1671–1680, 2013.
- M. Pesek, A. Leonardis, and M. Marolt, “Robust real-time music transcription with a compositional hierarchical model,” PloS one, vol. 12, no. 1, p. e0169411, 2017.
- K. Vaca, A. Gajjar, and X. Yang, “Real-time automatic music transcription (amt) with zync fpga,” in 2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI). IEEE, 2019, pp. 378–384.
- R. Kelz, M. Dorfer, F. Korzeniowski, S. Böck, A. Arzt, and G. Widmer, “On the potential of simple framewise approaches to piano transcription,” in Proc. of the 17th International Society for Music Information Retrieval Conference (ISMIR), August 2016, pp. 475–481.
- T. Kwon, D. Jeong, and J. Nam, “Polyphonic piano transcription using autoregressive multi-state note model,” in Proc. of the 21th International Society for Music Information Retrieval Conference (ISMIR), 2020.
- D. Jeong, “Real-time automatic piano music transcription system,” in Late Breaking and Demo of the 21st International Society of Music Information Retrieval (ISMIR) Conference, 2020.
- Q. Wang, R. Zhou, and Y. Yan, “Polyphonic piano transcription with a note-based music language model,” Applied Sciences, vol. 8, no. 3, p. 470, 2018.
- N. Boulanger-Lewandowski, Y. Bengio, and P. Vincent, “Modeling temporal dependencies in high-dimensional sequences: Application to polyphonic music generation and transcription,” in Proc. of the 29th International Conference on Machine Learning (ICML), 2012, pp. 1881––1888.
- A. Ycart, A. McLeod, E. Benetos, and K. Yoshii, “Blending acoustic and language model predictions for automatic music transcription,” in Proc. of the 20th International Society for Music Information Retrieval Conference (ISMIR), 2019, pp. 454–461.
- E. Vincent, N. Bertin, and R. Badeau, “Adaptive harmonic spectral decomposition for multiple pitch estimation,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 3, pp. 528–537, 2009.
- J. Nam, J. Ngiam, H. Lee, and M. Slaney, “A classification-based polyphonic piano transcription approach using learned feature representations,” in Proc. of the 12th International Society for Music Information Retrieval Conference (ISMIR), 2011, pp. 175–180.
- G. E. Poliner and D. P. Ellis, “A discriminative model for polyphonic piano transcription,” EURASIP Journal on Advances in Signal Processing, vol. 2007, pp. 1–9, 2006.
- Y. Yan, F. Cwitkowitz, and Z. Duan, “Skipping the frame-level: Event-based piano transcription with neural semi-crfs,” Advances in Neural Information Processing Systems, vol. 34, 2021.
- Q. Wang, R. Zhou, and Y. Yan, “A two-stage approach to note-level transcription of a specific piano,” Applied Sciences, vol. 7, September 2017.
- Y.-T. Wu, B. Chen, and L. Su, “Multi-instrument automatic music transcription with self-attention-based instance segmentation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 2796–2809, 2020.
- M. Taenzer, S. I. Mimilakis, and J. Abeßer, “Informing piano multi-pitch estimation with inferred local polyphony based on convolutional neural networks,” Electronics, vol. 10, no. 7, p. 851, 2021.
- N. Boulanger-Lewandowski, Y. Bengio, and P. Vincent, “High-dimensional sequence transduction,” in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2013, pp. 3178–3182.
- C. Hawthorne, I. Simon, R. Swavely, E. Manilow, and J. Engel, “Sequence-to-sequence piano transcription with transformers,” in Proc. of the 22th International Society for Music Information Retrieval Conference (ISMIR), 2021.
- A. Cont, D. Schwarz, N. Schnell, and C. Raphael, “Evaluation of real-time audio-to-score alignment,” in International Symposium on Music Information Retrieval (ISMIR), 2007.
- E. Perez, F. Strub, H. De Vries, V. Dumoulin, and A. Courville, “Film: Visual reasoning with a general conditioning layer,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32, no. 1, 2018.
- K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in Proceedings of the International Conference on Learning Representations (ICLR), 2015.
- C. ”Szegedy, W. ”Liu, Y. ”Jia, P. ”Sermanet, S. ”Reed, D. ”Anguelov, D. ”Erhan, V. ”Vanhoucke, and A. ”Rabinovich, “Very deep convolutional networks for large-scale image recognition,” in Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
- V. Emiya, R. Badeau, and B. David, “Multipitch estimation of piano sounds using a new probabilistic spectral smoothness principle,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 6, pp. 1643–1654, 2009.
- W. Goebl. (1999) The vienna 4x22 piano corpus. http://dx.doi.org/10.21939/4X22.
- M. Müller, V. Konz, W. Bogler, and V. Arifi-Müller, “Saarland music data (SMD),” in Late-Breaking and Demo Session of the 12th International Conference on Music Information Retrieval (ISMIR), Miami, USA, 2011.
- C. Raffel, B. McFee, E. J. Humphrey, J. Salamon, O. Nieto, D. Liang, and D. P. Ellis, “mir_eval: A transparent implementation of common MIR metrics,” in Proc. of the 15th International Society for Music Information Retrieval Conference (ISMIR), 2014, pp. 367–372.
- T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2980–2988.
- J. Zhuang, T. Tang, Y. Ding, S. C. Tatikonda, N. Dvornek, X. Papademetris, and J. Duncan, “Adabelief optimizer: Adapting stepsizes by the belief in observed gradients,” Advances in neural information processing systems, vol. 33, pp. 18 795–18 806, 2020.
- X. Gong, W. Xu, J. Liu, and W. Cheng, “Analysis and correction of maps dataset,” in the 22nd International Conference on Digital Audio Effects, 2019.
- F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: Machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.
- B. C. Ross, “Mutual information between discrete and continuous data sets,” PloS one, vol. 9, no. 2, p. e87357, 2014.
- K. W. Cheuk, D. Herremans, and L. Su, “Reconvat: A semi-supervised automatic music transcription framework for low-resource real-world data,” in Proceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 3918–3926.
- K. Choi and K. Cho, “Deep unsupervised drum transcription,” in 20th International Society for Music Information Retrieval Conference, ISMIR 2019. International Society for Music Information Retrieval, 2019, pp. 183–191.
- B. Gfeller, C. Frank, D. Roblek, M. Sharifi, M. Tagliasacchi, and M. Velimirović, “Spice: Self-supervised pitch estimation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 1118–1128, 2020.