Learning Soft-Attention Models for Tempo-invariant Audio-Sheet Music Retrieval (1906.10996v1)

Published 26 Jun 2019 in cs.IR, cs.CV, cs.LG, cs.SD, and eess.AS

Abstract: Connecting large libraries of digitized audio recordings to their corresponding sheet music images has long been a motivation for researchers to develop new cross-modal retrieval systems. In recent years, retrieval systems based on embedding space learning with deep neural networks got a step closer to fulfilling this vision. However, global and local tempo deviations in the music recordings still require careful tuning of the amount of temporal context given to the system. In this paper, we address this problem by introducing an additional soft-attention mechanism on the audio input. Quantitative and qualitative results on synthesized piano data indicate that this attention increases the robustness of the retrieval system by focusing on different parts of the input representation based on the tempo of the audio. Encouraged by these results, we argue for the potential of attention models as a very general tool for many MIR tasks.

Authors (5)

Stefan Balke (2 papers)
Matthias Dorfer (21 papers)
Luis Carvalho (19 papers)
Andreas Arzt (14 papers)
Gerhard Widmer (144 papers)

Citations (11)

View on Semantic Scholar

Summary

The paper presents a novel soft-attention mechanism that dynamically adjusts temporal context to improve retrieval accuracy.
It leverages a learned multimodal embedding space, effectively bypassing preprocessing errors from traditional transcription and OMR.
Empirical results highlight robust performance with long context models achieving 66.71% recall at rank 1 and an MRR of 0.75.

Learning Soft-Attention Models for Tempo-invariant Audio-Sheet Music Retrieval

This essay discusses the paper "Learning Soft-Attention Models for Tempo-invariant Audio-Sheet Music Retrieval" by Stefan Balke, Matthias Dorfer, Luis Carvalho, Andreas Arzt, and Gerhard Widmer. The paper addresses a challenging issue in the domain of music information retrieval (MIR): connecting digitized audio recordings to their respective sheet music images, especially considering the varying tempos found in audio recordings.

Overview of the Proposed Method

The core challenge in the paper stems from global and local tempo deviations in musical recordings, which necessitate careful tuning of temporal context in retrieval systems. Traditional methods often involve mid-level representations for time and position alignment but suffer from preprocessing errors during music transcription and optical music recognition (OMR). To overcome these issues, the authors propose an alternative approach utilizing an embedding space learned directly from multimodal data through deep neural networks. This novel approach avoids the drawbacks of conventional preprocessing.

The innovation introduced in this paper is a soft-attention mechanism implemented on the audio input. This mechanism allows the retrieval system to adjust the temporal context by focusing on different sections of the audio spectrogram based on tempo. Consequently, the system demonstrates increased robustness against tempo variations, benefiting from an adaptable temporal context that enhances retrieval accuracy.

Strong Numerical Results

The paper presents rigorous quantitative evaluations, demonstrating how the introduced soft-attention mechanism significantly improves retrieval performance over baseline models. Various configurations were tested, including different temporal context lengths — short context (SC), medium context (MC), and long context (LC). Baseline models without attention experienced degradation in performance with increased temporal context. However, models incorporating the attention network exhibited consistent improvements across metrics such as mean reciprocal rank (MRR) and recall at various ranks (R@k).

Notably, the model with long context accompanied by attention achieved a recall of 66.71% at rank 1 and an MRR of 0.75, showcasing the effectiveness of the attention mechanism in leveraging larger temporal contexts.

Implications and Future Research Directions

This research holds implications for the broader field of MIR and cross-modal retrieval tasks. The adoption of soft-attention mechanisms paves the way for more resilient systems capable of handling temporal variability, a feature prevalent in real-world music performances. In practical terms, such systems can streamline the mapping of audio to corresponding sheet music, enhancing music education tools, digital music libraries, and automated music analysis systems.

Theoretically, this method can extend to other domains where cross-modal data alignment is essential. By enabling models to dynamically adjust context based on input characteristics, the approach offers a generalizable strategy for achieving robust performance in diverse applications.

Future research may expand on this by applying the proposed attention mechanism to real performances, exploring adaptations for polyphonic music with varying timbres, and extending the model's capabilities beyond the pianistic domain. Integrating such mechanisms with more complex music compositions and performance nuances would further validate and refine the applicability of this approach.

Overall, the paper presents a comprehensive paper on enhancing audio-sheet music retrieval systems, yielding promising results through the integration of a soft-attention mechanism. This work advances the capabilities of MIR systems, providing an adaptable solution to the inherent challenges posed by tempo variations in audio recordings.

PDF Markdown

Related Papers

YouTube

Show All Videos