Automatic Musical Transcription Applying Fine-Tuning by Composer in Neural Networks for the MusicNet Database

Automatic Musical Transcription Applying Fine-Tuning by Composer in Neural Networks for the MusicNet Database

  IJETT-book-cover           
  
© 2022 by IJETT Journal
Volume-70 Issue-12
Year of Publication : 2022
Author : Leonardo Veronez Simões, Antônio Roberto Monteiro Simões, Karin Satie Komati, Jefferson Oliveira Andrade
DOI : 10.14445/22315381/IJETT-V70I12P231

How to Cite?

Leonardo Veronez Simões, Antônio Roberto Monteiro Simões, Karin Satie Komati, Jefferson Oliveira Andrade, "Automatic Musical Transcription Applying Fine-Tuning by Composer in Neural Networks for the MusicNet Database," International Journal of Engineering Trends and Technology, vol. 70, no. 12, pp. 328-337, 2022. Crossref, https://doi.org/10.14445/22315381/IJETT-V70I12P231

Abstract
The task of automatic music transcription is to build algorithms to convert acoustic music signals into some form of musical notation. Previous works train a single neural network architecture for a complete database of different composers. This work presupposes that each composer has their own characteristic style. The objective of this work is to carry out two stages of training. The first stage results in a generic trained model, and the second is a specialized and fine-tuned retraining step, generating a model for each compositor. To achieve this objective, experiments were carried out with two different Neural Network architectures —MLP and Convolutional— using the MusicNet database. Overall, the results with fine-tuning improved the average accuracy, except for composers with fewer musical works.

Keywords
Convolutional Neural Networks, Multilabel Classification, Multilayer Perceptron, Short Time Fourier Transform, Spectrogram, Transfer Learning.

References
[1] Dorothea Blostein, and Henry S. Baird, “A Critical Survey of Music Image Analysis,” Structured Document Image Analysis, Springer, pp. 405–434, 1992. Crossref, https://doi.org/10.1007/978-3-642-77281-8_19
[2] Savitri Apparo Nawade, Mallikarjun Hangarge, and Shivanand S Rumma, "Deep Learning-Based Approach for Old Handwritten Music Symbol Recognition," International Journal of Engineering Trends and Technology, vol. 69, no. 7, pp. 208-214, 2021. Crossref, https://doi.org/10.14445/22315381/IJETT-V69I7P228
[3] Emmanouil Benetos et al., “Automatic Music Transcription: An Overview,” IEEE Signal Processing Magazine, vol. 36, no. 1, pp. 20– 30, 2019. Crossref, https://doi.org/10.1109/MSP.2018.2869928
[4] Josh Gardner et al., “Mt3: Multi-Task Multitrack Music Transcription,” arXiv preprint arXiv:2111.03017, 2021. Crossref, https://doi.org/10.48550/arXiv.2111.03017
[5] Valentin Emiya, Roland Badeau, and Bertrand David, “Multipitch Estimation of Piano Sounds Using a New Probabilistic Spectral Smoothness Principle,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 6, pp. 1643–1654, 2010. Crossref, https://doi.org/10.1109/TASL.2009.2038819
[6] Li Su, and Yi-Hsuan Yang, “Combining Spectral and Temporal Representations for Multipitch Estimation of Polyphonic Music,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23, no. 10, pp. 1600–1612, 2015. Crossref, https://doi.org/10.1109/TASLP.2015.2442411
[7] A. T. Cemgil, H. J. Kappen, and D. Barber, “A Generative Model for Music Transcription,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, no. 2, pp. 679–694, 2006. Crossref, https://doi.org/10.1109/TSA.2005.852985
[8] Paul H. Peeling, A. Taylan Cemgil, and Simon J. Godsill, “Generative Spectrogram Factorization Models for Polyphonic Piano Transcription,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 3, pp. 519–527, 2010. Crossref, https://doi.org/10.1109/TASL.2009.2029769
[9] Andrea Cogliati, and Zhiyao Duan, “Piano Music Transcription Modeling Note Temporal Evolution,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, pp. 429–433, 2015. Crossref, https://doi.org/10.1109/ICASSP.2015.7178005
[10] P. Smaragdis, and J. C. Brown, “Non-Negative Matrix Factorization for Polyphonic Music Transcription,” 2003 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (IEEE Cat. No. 03TH8684), IEEE, pp. 177–180, 2003. Crossref, https://doi.org/10.1109/ASPAA.2003.1285860
[11] Emmanuel Vincent, Nancy Bertin, and Roland Badeau, “Adaptive Harmonic Spectral Decomposition for Multiple Pitch Estimation,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 3, pp. 528–537, 2010. Crossref, https://doi.org/10.1109/TASL.2009.2034186
[12] Emmanouil Benetos, and Simon Dixon, “Multiple-Instrument Polyphonic Music Transcription Using a Temporally Constrained ShiftInvariant Model,” The Journal of the Acoustical Society of America, vol. 133, no. 3, pp. 1727–1741, 2013. Crossref, http://dx.doi.org/10.1121/1.4790351
[13] Maksim Khadkevich, and Maurizio Omologo, “Use of Hidden Markov Models and Factored Language Models for Automatic Chord Recognition,” Proceedings of the 10th International Society for Music Information Retrieval Conference, ISMIR 2009, pp. 561–566, Citeseer, 2009.
[14] Sebastian Ewert, and Mark Sandler, “Piano Transcription in the Studio Using an Extensible Alternating Directions Framework,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 11, pp. 1983–1997, 2016. Crossref, https://doi.org/10.1109/TASLP.2016.2593801
[15] John Thickstun, Zaid Harchaoui, and Sham M. Kakade, “Learning Features of Music from Scratch,” International Conference on Learning Representations (ICLR), 2017.
[16] S. Sadie, The New Grove Composer Biography Series (Bach Family, Handel, Haydn, Mozart, Beethoven, Schubert, Masters of Italian Opera, Second Viennese School), Notes, vol. 32, no. 2, pp. 259–268, 1975.
[17] Youssef Tamaazousti et al., “Learning More Universal Representations for Transfer-Learning,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, no. 9, pp. 2212–2224, 2020. Crossref, https://doi.org/10.1109/TPAMI.2019.2913857
[18] K. Pranathi, G. Ravi Kumar, and J. S. S. Aditya, "Fault Analysis on Multi-Terminal System Using Wavelet Transform and Wavelet Morphing Technique," SSRG International Journal of Electrical and Electronics Engineering, vol. 8, no. 6, pp. 28-37, 2021. Crossref, https://doi.org/10.14445/23488379/IJEEE-V8I6P105
[19] John Thickstun et al., “Invariances and Data Augmentation for Supervised Music Transcription,” 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, pp. 2241–2245, 2018. Crossref, https://doi.org/10.1109/ICASSP.2018.8461686
[20] Kin Wai Cheuk, Kat Agres, and Dorien Herremans, “The Impact of Audio Input Representations on Neural Network Based Music Transcription,” in 2020 International Joint Conference on Neural Networks (IJCNN), IEEE, pp. 1–6, 2020. Crossref, https://doi.org/10.1109/IJCNN48605.2020.9207605
[21] Deepanway Ghosal, and Maheshkumar H. Kolekar, “Musical Genre and Style Recognition Using Deep Neural Networks and Transfer Learning,” in Proceedings, APSIPA Annual Summit and Conference, vol. 2018, pp. 2087-2091, 2018. Crossref, https://doi.org/10.21437/Interspeech.2018-2045
[22] G. Tzanetakis, and P. Cook, “Musical Genre Classification of Audio Signals,” IEEE Transactions on Speech and Audio Processing, vol. 10, no. 5, pp. 293–302, 2002. Crossref, https://doi.org/10.1109/TSA.2002.800560
[23] Fabien Gouyon et al., “Evaluating Rhythmic Descriptors for Musical Genre Classification,” Proceedings of the AES 25th International Conference, vol. 196, p. 204, 2004.
[24] Murali Matcha et al., "Design and Performance Analysis of Multilayer Neural Network-based Battery Energy Storage System for Enhancing Demand Side Management," SSRG International Journal of Electrical and Electronics Engineering, vol. 9, no. 10, pp. 7- 13, 2022. Crossref, https://doi.org/10.14445/23488379/IJEEE-V9I10P102
[25] Gabriel Meseguer-Brocal, Alice Cohen-Hadria, and Geoffroy Peeters, “Dali: A Large Dataset of Synchronized Audio, Lyrics and Notes, Automatically Created Using Teacher-Student Machine Learning Paradigm,” arXiv Preprint arXiv:1906.10606, pp. 431-437, 2019. Crossref, https://doi.org/10.5281/zenodo.1492443
[26] Jens Kofod Hansen, and I. Fraunhofer, “Recognition of Phonemes in A-Cappella Recordings Using Temporal Patterns and Mel Frequency Cepstral Coefficients,” in 9th Sound and Music Computing Conference (SMC), 2012. Crossref, https://doi.org/10.5281/zenodo.850135
[27] Matthias Mauch, Hiromasa Fujihara, and Masataka Goto, “Integrating Additional Chord Information into Hmm-Based Lyrics-to-Audio Alignment,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 1, pp. 200–210, 2011. Crossref, https://doi.org/10.1109/TASL.2011.2159595
[28] Daniel Stoller, Simon Durand, and Sebastian Ewert, “End-to-End Lyrics Alignment for Polyphonic Music Using an Audio-toCharacter Recognition Model,” ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, pp. 181–185, 2019. Crossref, https://doi.org/10.1109/ICASSP.2019.8683470
[29] Gerardo Roa Dabike, and Jon Barker, “Automatic Lyric Transcription from Karaoke Vocal Tracks: Resources and a Baseline System,” Interspeech, pp. 579–583, 2019. Crossref, https://doi.org/10.21437/Interspeech.2019-2378
[30] The Musicnet Dataset Website, 2016. [Online]. Available: https://zenodo.org/record/5120004#.YXDPwKBlBpQ
[31] The Musicnet Inspector Website. [Online]. Available: https://musicnet-inspector.github.io/
[32] Hendrik Purwins et al., “Deep Learning for Audio Signal Processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 13, no. 2, pp. 206–219, 2019. Crossref, https://doi.org/10.1109/JSTSP.2019.2908700
[33] The Jthickstun Github, 2017. [Online]. Available: https://github.com/jthickstun/thickstun2017learning
[34] Christoph Käding et al., “Fine-Tuning Deep Neural Networks in Continuous Learning Scenarios,” in Asian Conference on Computer Vision, Springer, pp. 588– 605, 2016. Crossref, https://doi.org/10.1007/978-3-319-54526-4_43
[35] Longshen Ou, Xiangming Gu, and Ye Wang, “Transfer Learning of Wav2vec 2.0 for Automatic Lyric Transcription,” Proceedings of the 23rd International Society for Music Information Retrieval Conference, 2022.