Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 82 tok/s
Gemini 2.5 Pro 62 tok/s Pro
GPT-5 Medium 32 tok/s Pro
GPT-5 High 36 tok/s Pro
GPT-4o 78 tok/s Pro
Kimi K2 195 tok/s Pro
GPT OSS 120B 423 tok/s Pro
Claude Sonnet 4.5 33 tok/s Pro
2000 character limit reached

ADTOF Package: Automated Drum Transcription

Updated 30 September 2025
  • ADTOF Package is an end-to-end system that standardizes crowdsourced drum annotations with robust correction pipelines for accurate transcription.
  • It employs a convolutional recurrent neural network that converts audio into spectrograms and extracts temporal patterns for precise onset detection.
  • It enhances transcription by expanding drum class categories and integrating MIDI velocity estimation to capture detailed musical dynamics.

The ADTOF package is an end-to-end system for automatic drum transcription (ADT) that combines a large crowdsourced dataset of real-world, manually annotated drum performances with a robust data processing pipeline, machine learning infrastructure, and post-processing routines. It is centered around the creation and exploitation of a standardized dataset derived from rhythm game charts, facilitating state-of-the-art drum onset transcription across multiple datasets and scenarios, and enabling new directions for music information retrieval (MIR) and music production applications.

1. Dataset Construction and Annotation Standardization

The ADTOF dataset is constructed by aggregating custom rhythm game charts annotated by a large community of users. These charts contain precise timing and drum label information based on gameplay and animation events. Given the heterogeneous sources, ADTOF provides tools for converting varied proprietary chart formats into a standardized transcription format suitable for machine learning workflows. This ingestion/conversion pipeline supports scalability and robustness in data preparation.

A major challenge addressed by ADTOF is the inherent noise in crowdsourced annotations. Original chart onsets exhibit deviations up to ~50 ms from actual musical onsets, rendering them unsuitable for modern ADT approaches that require sub-10 ms alignment. The package applies an automatic data cleansing sequence: (1) beat tracking algorithms such as Böck’s Madmom-based method yield reliable temporal anchor points, and (2) a linear interpolation correction “snaps” bustling onsets to a refined grid. For any beat interval [t1,t2][t_1, t_2] with beat corrections Δt1\Delta t_1, Δt2\Delta t_2, an arbitrary timestamp tt is corrected via

tcorrect=t+Δt(t),Δt(t)=Δt1+tt1t2t1(Δt2Δt1)t_\text{correct} = t + \Delta t(t),\quad \Delta t(t) = \Delta t_1 + \frac{t - t_1}{t_2 - t_1}(\Delta t_2 - \Delta t_1)

This procedure dramatically reduces annotation error, directly improving training label quality.

Label standardization is imposed via automatic remapping of instrument class labels to five canonical categories: bass drum, snare, toms, hi-hats, and combined crash/ride cymbals. This harmonization mitigates inconsistencies due to local nomenclature differences in crowdsourced charts and simplifies downstream classification.

2. Model Architecture and Spectrogram Representation

ADTOF employs a convolutional recurrent neural network (CRNN) architecture for onset detection and drum class prediction. Input audio is transformed into a time–frequency representation using a short-time Fourier transform (STFT) with a window size of 2048 and hop length of 441 samples (yielding ~100 Hz frame rate). The magnitude spectrum is filtered onto a log-frequency scale (12 bins/octave, 20–20,000 Hz) and mapped to a log-scale energy feature:

S(m,n)=log(kX(k,n)2ϕm(k))S(m, n) = \log \left( \sum_k |X(k, n)|^2 \phi_m(k) \right)

where X(k,n)X(k, n) denotes the STFT and ϕm(k)\phi_m(k) is the triangular filter of the mthm^\text{th} log-frequency bin.

Convolutional layers extract local acoustic features, with recurrent layers (LSTM/GRU) modeling cross-temporal patterns. The final output is a per-class activation time series:

y(t)=σ(Wrh(t)+b)y(t) = \sigma(W_r h(t) + b)

Here, h(t)h(t) is the recurrent output, WrW_r and bb are learned parameters, and σ\sigma is a sigmoid function. Peak picking over y(t)y(t) delivers the symbolic note transcription.

3. Drum Stem Source Separation and Output Expansion

The original ADTOF system is limited to five drum classes, primarily due to the dataset’s labeling structure. Recent work extends its transcription granularity through integration with drum stem source separation models. Specifically, algorithms such as the Jarredou model decompose the drum track into isolated “stems,” permitting discrimination between multiple cymbal types (e.g., crash and ride).

A post-processing phase exploits the separated stem loudness envelopes:

  • Class Expansion (from 5 to 7):
    • Crash/ride: For every detected cymbal event, relative stem loudness is evaluated. A “refraction period” is enforced (e.g., no crash after a major peak until just before the next), so subsequent cymbal hits during decay are re-mapped as ride.
    • Hi-hats: Within the onset’s local window, if the minimum loudness is greater than 75% of the maximum (Lmin0.75LmaxL_\text{min}\geq0.75L_\text{max}), the hit is labeled “open,” else “closed.”
  • MIDI Velocity Estimation:
    • For each detected event, the maximum loudness in a 50 ms window centered at the onset is extracted and mapped via a normalized dynamic range to MIDI velocity (0–127). Loudness curves are computed with an equal loudness filter and 10 ms hop.

This output expansion enables generation of 7-class and velocity-parameterized MIDI transcriptions from an originally 5-class system, improving both structural and expressive realism.

4. Performance Evaluation and Benchmarks

Empirical evaluation is conducted on established datasets (MDB, RBMA, ENST), using both the standard 5-class and expanded 8-class (7-class from ADTOF plus a rare cowbell class) transcription objectives.

Summary of reported benchmarks:

Dataset 5-class F-measure 8-class F-measure (delta vs baseline)
MDB 0.89 +12%
ENST 0.85 +10%
RBMA −2%

Performance gains in class expansion are substantial for acoustic datasets (MDB, ENST). A minor drop for RBMA is attributed to prevalence of electronic drum sounds, where stem separation may have greater confounding effects.

5. Automatic Annotation Correction: Impact and Rationale

Precise temporal alignment and label consistency are crucial for convolutional–recurrent models, where noisy or inconsistent labels degrade discriminative capacity. The ADTOF correction pipeline reduces annotation noise through both beat-aligned linear correction and systematic class remapping, substantially narrowing the gap between crowdsourced data and expert-annotated corpora. This is especially critical given the sensitivity of onset detectors to temporal jitter. Downstream, these corrections yield improved F-measure and generalization when training ADT models, even transferring to external evaluation datasets.

6. Applications, Capabilities, and Extensions

By design, ADTOF is applicable as both a training and benchmark dataset for ADT research. Its scale (114 hours) and realism (derived from commercial music performances rather than synthetic or isolated stems) enable state-of-the-art generalization: models trained solely on ADTOF achieve performance matching or exceeding prior approaches utilizing multi-step real/synthetic corpora integration. This authenticity supports tasks in MIR such as drum pattern analysis, performance evaluation, and dataset curation.

The integration of class expansion (post-processing) and velocity extraction enhances its value to music production and musicological applications, yielding transcriptions that translate into detailed and dynamically nuanced MIDI representations. These attributes also facilitate hybrid research at the intersection of source separation, classification, and symbolic music processing.

7. Unique Features and Research Context

The ADTOF package exhibits a set of distinctive attributes:

  • Scale and Coverage: Approximately two orders of magnitude larger than previous non-synthetic datasets.
  • Authenticity: Directly sourced from real-world, full-mix recordings.
  • End-to-End Curation: Automated ingestion, annotation correction, and label harmonization.
  • Integration: Directly supports CRNN workflows with ready-to-use spectrogram and label formats.
  • Extensibility: Supports class expansion and MIDI velocity output via principled use of external stem models and deterministic heuristics.
  • Benchmarking: Demonstrates improved or competitive results on a range of MIR evaluation datasets, positioning it as a central corpus for ADT benchmarking and development.

These properties address long-standing gaps in the field, enabling robust supervised learning pipelines for drum onset transcription and facilitating detailed symbolic representation in downstream MIR and creative workflows (Zehren et al., 2021, Riley et al., 29 Sep 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to ADTOF Package.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube