Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 82 tok/s

Gemini 2.5 Pro 62 tok/s Pro

GPT-5 Medium 32 tok/s Pro

GPT-5 High 36 tok/s Pro

GPT-4o 78 tok/s Pro

Kimi K2 195 tok/s Pro

GPT OSS 120B 423 tok/s Pro

Claude Sonnet 4.5 33 tok/s Pro

2000 character limit reached

ADTOF Package: Automated Drum Transcription

Updated 30 September 2025

ADTOF Package is an end-to-end system that standardizes crowdsourced drum annotations with robust correction pipelines for accurate transcription.
It employs a convolutional recurrent neural network that converts audio into spectrograms and extracts temporal patterns for precise onset detection.
It enhances transcription by expanding drum class categories and integrating MIDI velocity estimation to capture detailed musical dynamics.

The ADTOF package is an end-to-end system for automatic drum transcription (ADT) that combines a large crowdsourced dataset of real-world, manually annotated drum performances with a robust data processing pipeline, machine learning infrastructure, and post-processing routines. It is centered around the creation and exploitation of a standardized dataset derived from rhythm game charts, facilitating state-of-the-art drum onset transcription across multiple datasets and scenarios, and enabling new directions for music information retrieval (MIR) and music production applications.

1. Dataset Construction and Annotation Standardization

The ADTOF dataset is constructed by aggregating custom rhythm game charts annotated by a large community of users. These charts contain precise timing and drum label information based on gameplay and animation events. Given the heterogeneous sources, ADTOF provides tools for converting varied proprietary chart formats into a standardized transcription format suitable for machine learning workflows. This ingestion/conversion pipeline supports scalability and robustness in data preparation.

A major challenge addressed by ADTOF is the inherent noise in crowdsourced annotations. Original chart onsets exhibit deviations up to ~50 ms from actual musical onsets, rendering them unsuitable for modern ADT approaches that require sub-10 ms alignment. The package applies an automatic data cleansing sequence: (1) beat tracking algorithms such as Böck’s Madmom-based method yield reliable temporal anchor points, and (2) a linear interpolation correction “snaps” bustling onsets to a refined grid. For any beat interval $[t_1, t_2]$ with beat corrections $\Delta t_1$ , $\Delta t_2$ , an arbitrary timestamp $t$ is corrected via

$t_\text{correct} = t + \Delta t(t),\quad \Delta t(t) = \Delta t_1 + \frac{t - t_1}{t_2 - t_1}(\Delta t_2 - \Delta t_1)$

This procedure dramatically reduces annotation error, directly improving training label quality.

Label standardization is imposed via automatic remapping of instrument class labels to five canonical categories: bass drum, snare, toms, hi-hats, and combined crash/ride cymbals. This harmonization mitigates inconsistencies due to local nomenclature differences in crowdsourced charts and simplifies downstream classification.

2. Model Architecture and Spectrogram Representation

ADTOF employs a convolutional recurrent neural network (CRNN) architecture for onset detection and drum class prediction. Input audio is transformed into a time–frequency representation using a short-time Fourier transform (STFT) with a window size of 2048 and hop length of 441 samples (yielding ~100 Hz frame rate). The magnitude spectrum is filtered onto a log-frequency scale (12 bins/octave, 20–20,000 Hz) and mapped to a log-scale energy feature:

$S(m, n) = \log \left( \sum_k |X(k, n)|^2 \phi_m(k) \right)$

where $X(k, n)$ denotes the STFT and $\phi_m(k)$ is the triangular filter of the $m^\text{th}$ log-frequency bin.

Convolutional layers extract local acoustic features, with recurrent layers (LSTM/GRU) modeling cross-temporal patterns. The final output is a per-class activation time series:

$y(t) = \sigma(W_r h(t) + b)$

Here, $h(t)$ is the recurrent output, $W_r$ and $b$ are learned parameters, and $\sigma$ is a sigmoid function. Peak picking over $y(t)$ delivers the symbolic note transcription.

3. Drum Stem Source Separation and Output Expansion

The original ADTOF system is limited to five drum classes, primarily due to the dataset’s labeling structure. Recent work extends its transcription granularity through integration with drum stem source separation models. Specifically, algorithms such as the Jarredou model decompose the drum track into isolated “stems,” permitting discrimination between multiple cymbal types (e.g., crash and ride).

A post-processing phase exploits the separated stem loudness envelopes:

Class Expansion (from 5 to 7):
- Crash/ride: For every detected cymbal event, relative stem loudness is evaluated. A “refraction period” is enforced (e.g., no crash after a major peak until just before the next), so subsequent cymbal hits during decay are re-mapped as ride.
- Hi-hats: Within the onset’s local window, if the minimum loudness is greater than 75% of the maximum ( $L_\text{min}\geq0.75L_\text{max}$ ), the hit is labeled “open,” else “closed.”
MIDI Velocity Estimation:
- For each detected event, the maximum loudness in a 50 ms window centered at the onset is extracted and mapped via a normalized dynamic range to MIDI velocity (0–127). Loudness curves are computed with an equal loudness filter and 10 ms hop.

This output expansion enables generation of 7-class and velocity-parameterized MIDI transcriptions from an originally 5-class system, improving both structural and expressive realism.

4. Performance Evaluation and Benchmarks

Empirical evaluation is conducted on established datasets (MDB, RBMA, ENST), using both the standard 5-class and expanded 8-class (7-class from ADTOF plus a rare cowbell class) transcription objectives.

Summary of reported benchmarks:

Dataset	5-class F-measure	8-class F-measure (delta vs baseline)
MDB	0.89	+12%
ENST	0.85	+10%
RBMA	—	−2%

Performance gains in class expansion are substantial for acoustic datasets (MDB, ENST). A minor drop for RBMA is attributed to prevalence of electronic drum sounds, where stem separation may have greater confounding effects.

5. Automatic Annotation Correction: Impact and Rationale

Precise temporal alignment and label consistency are crucial for convolutional–recurrent models, where noisy or inconsistent labels degrade discriminative capacity. The ADTOF correction pipeline reduces annotation noise through both beat-aligned linear correction and systematic class remapping, substantially narrowing the gap between crowdsourced data and expert-annotated corpora. This is especially critical given the sensitivity of onset detectors to temporal jitter. Downstream, these corrections yield improved F-measure and generalization when training ADT models, even transferring to external evaluation datasets.

6. Applications, Capabilities, and Extensions

By design, ADTOF is applicable as both a training and benchmark dataset for ADT research. Its scale (114 hours) and realism (derived from commercial music performances rather than synthetic or isolated stems) enable state-of-the-art generalization: models trained solely on ADTOF achieve performance matching or exceeding prior approaches utilizing multi-step real/synthetic corpora integration. This authenticity supports tasks in MIR such as drum pattern analysis, performance evaluation, and dataset curation.

The integration of class expansion (post-processing) and velocity extraction enhances its value to music production and musicological applications, yielding transcriptions that translate into detailed and dynamically nuanced MIDI representations. These attributes also facilitate hybrid research at the intersection of source separation, classification, and symbolic music processing.

7. Unique Features and Research Context

The ADTOF package exhibits a set of distinctive attributes:

Scale and Coverage: Approximately two orders of magnitude larger than previous non-synthetic datasets.
Authenticity: Directly sourced from real-world, full-mix recordings.
End-to-End Curation: Automated ingestion, annotation correction, and label harmonization.
Integration: Directly supports CRNN workflows with ready-to-use spectrogram and label formats.
Extensibility: Supports class expansion and MIDI velocity output via principled use of external stem models and deterministic heuristics.
Benchmarking: Demonstrates improved or competitive results on a range of MIR evaluation datasets, positioning it as a central corpus for ADT benchmarking and development.

These properties address long-standing gaps in the field, enabling robust supervised learning pipelines for drum onset transcription and facilitating detailed symbolic representation in downstream MIR and creative workflows (Zehren et al., 2021, Riley et al., 29 Sep 2025).

PDF Markdown Chat (Pro)

References (2)

ADTOF: A large dataset of non-synthetic music for automatic drum transcription (2021)

Enhanced Automatic Drum Transcription via Drum Stem Source Separation (2025)

Follow Topic

Get notified by email when new papers are published related to ADTOF Package.

ADTOF Package: Automated Drum Transcription

1. Dataset Construction and Annotation Standardization

2. Model Architecture and Spectrogram Representation

3. Drum Stem Source Separation and Output Expansion

4. Performance Evaluation and Benchmarks

5. Automatic Annotation Correction: Impact and Rationale

6. Applications, Capabilities, and Extensions

7. Unique Features and Research Context

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

ADTOF Package: Automated Drum Transcription

1. Dataset Construction and Annotation Standardization

2. Model Architecture and Spectrogram Representation

3. Drum Stem Source Separation and Output Expansion

4. Performance Evaluation and Benchmarks

5. Automatic Annotation Correction: Impact and Rationale

6. Applications, Capabilities, and Extensions

7. Unique Features and Research Context

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research