BEAST: Online Joint Beat and Downbeat Tracking Based on Streaming Transformer (2312.17156v3)

Published 28 Dec 2023 in cs.SD and eess.AS

Abstract: Many deep learning models have achieved dominant performance on the offline beat tracking task. However, online beat tracking, in which only the past and present input features are available, still remains challenging. In this paper, we propose BEAt tracking Streaming Transformer (BEAST), an online joint beat and downbeat tracking system based on the streaming Transformer. To deal with online scenarios, BEAST applies contextual block processing in the Transformer encoder. Moreover, we adopt relative positional encoding in the attention layer of the streaming Transformer encoder to capture relative timing position which is critically important information in music. Carrying out beat and downbeat experiments on benchmark datasets for a low latency scenario with maximum latency under 50 ms, BEAST achieves an F1-measure of 80.04% in beat and 46.78% in downbeat, which is a substantial improvement of about 5 percentage points over the state-of-the-art online beat tracking model.

References (27)

Citations (2)

View on Semantic Scholar

Summary

The paper presents BEAST, which leverages streaming Transformers and contextual block processing for real-time beat and downbeat tracking.
It employs relative positional encoding to achieve an F1-measure of 83.65% for beat tracking at 46 ms latency, outperforming CRNN-based models.
The approach advances real-time music information retrieval, offering practical benefits for digital audio workstations and virtual accompaniment systems.

An Evaluation of BEAST: Online Beat and Downbeat Tracking with Streaming Transformers

The paper "BEAST: Online Joint Beat and Downbeat Tracking Based on Streaming Transformer," authored by Chih-Cheng Chang and Li Su from the Institute of Information Science, Academia Sinica, addresses critical challenges in online music beat tracking. Through the development of the BEAST framework, the paper contributes a novel approach leveraging a streaming Transformer model, specifically tailored for online joint beat and downbeat tracking with an emphasis on low latency requirements.

Key Contributions and Methodology

BEAST adopts a Transformer-based model that utilizes contextual block processing to facilitate online capabilities. Traditionally, Transformer architectures require the complete input sequence to calculate attention scores, presenting significant challenges for real-time applications. BEAST circumvents this limitation by segmenting the input sequence into non-overlapping blocks, each coupled with additional context frames for left and right sub-blocks. This strategy maintains the critical temporal context necessary for accurate beat and downbeat predictions while supporting incremental processing, enhancing suitability for online operations.

The model further incorporates a relative positional encoding mechanism instead of the more conventional absolute positional encoding. This improvement ensures the model captures the pairwise relationships between musical elements, which are vital in understanding rhythmic structures. Experimental evaluations demonstrated superior performance with the relative positional encoding, achieving a notable F1-measure of 83.65%, indicating its efficacy over previous methodologies relying on absolute positional encoding.

Numerical Results and Performance

Experimental results reveal that BEAST achieves an F1-measure of 80.04% for beat tracking with a latency of 46 ms, setting a new benchmark in beat tracking accuracy. The system also demonstrates a substantial enhancement in downbeat tracking performance, marked by an F1-measure of 46.78%. These results signify an approximate 5% improvement compared to leading alternatives, such as the CRNN-based state-of-the-art models. In particular, BEAST addresses a long-standing challenge in online beat tracking by balancing performance with latency considerations, achieving lower real-time factors (RTFs) without compromising performance integrity.

Implications and Speculation on Future Directions

The development of BEAST delineates significant advancements for real-time music information retrieval, particularly in applications where immediate responsiveness is crucial, such as digital audio workstations and virtual accompaniment systems. BEAST's dual focus on latency and accuracy positions it as a potential foundation for extensions into broader MIR domains, such as online transcription and automated music generation.

Looking forward, the successful adaptation of streaming Transformer architectures to MIR tasks signals a promising pathway for future research endeavors in real-time audio processing. An exploration into integrating BEAST with generative models could pave the way for sophisticated real-time accompaniment and compositional systems. Moreover, these findings endorse further investigations into scaling BEAST's architecture for handling larger, more complex musical datasets, thus potentially yielding insights into more generalizable rhythmic models.

In conclusion, BEAST exemplifies a significant contribution to the field of music beat tracking, addressing long-standing challenges associated with real-time processing. The model’s innovative use of streaming Transformers promises to inform future advancements and applications within music information retrieval, setting a precedent for subsequent research in this domain.

PDF Markdown

Related Papers

GitHub

Tweets

https://twitter.com/ArxivSound/status/1782983858071220463

https://twitter.com/AudioAndSpeech/status/1783125416963604947