DexDrummer: Drum Audio Research Nexus

Updated 4 July 2026

DexDrummer is a research nexus focused on recovering symbolic drum events from audio, curating large-scale transcription datasets, and synthesizing drum sounds from compact, controllable representations.
It leverages techniques like partially fixed NMF with momentum-based optimization to improve transcription accuracy and ensure human-readable drum event detection.
The framework also spans dataset engineering (ADTOF), controllable drum synthesis (StyleWaveGAN, DDX7), and drum-machine inversion for analysis-by-synthesis separation.

DexDrummer, as the term is used across the cited literature, does not denote a single universally fixed artifact. Instead, the materials associate it with several closely related drum-audio research directions: interpretable automatic drum transcription based on partially fixed nonnegative matrix factorization (NMF), the large-scale crowdsourced transcription dataset ADTOF, and drum-machine-style analysis-by-synthesis source separation; adjacent work on controllable drum and instrument synthesis provides additional technical context (Foster et al., 16 Jul 2025, Zehren et al., 2021, Torres et al., 6 May 2025, Lavault et al., 2022, Caspe et al., 2022). This suggests that DexDrummer is best understood as a drum-centric research nexus organized around three recurrent objectives: recovering symbolic drum events from audio, learning or curating representations that make those events trainable at scale, and rendering or reconstructing drum sounds from compact, controllable parameterizations.

1. Terminological scope and research landscape

Within the cited papers, the name is attached to different but structurally related objects. One paper explicitly describes “DexDrummer-style automatic drum transcription” in terms of a partially fixed NMF applied to a magnitude spectrogram (Foster et al., 16 Jul 2025). Another identifies DexDrummer with the dataset ADTOF (“Automatic Drums Transcription On Fire”), a large non-synthetic corpus for drum transcription in the presence of melodic instruments (Zehren et al., 2021). A third presents the Inverse Drum Machine as a “DexDrummer-style” or drum-machine-style separator that performs source separation by first transcribing drum events and then synthesizing one-shot drum sounds (Torres et al., 6 May 2025). Two further papers are directly relevant as technical comparators for controllable synthesis: StyleWaveGAN for drum and cymbal waveform generation (Lavault et al., 2022) and DDX7 for differentiable FM resynthesis of musical instrument sounds (Caspe et al., 2022).

Paper	DexDrummer relation	Central object
(Foster et al., 16 Jul 2025)	“DexDrummer-style automatic drum transcription”	Partially fixed NMF with momentum
(Zehren et al., 2021)	DexDrummer identified with ADTOF	Large DTM dataset
(Torres et al., 6 May 2025)	“DexDrummer-style” drum-machine inversion	Joint transcription and synthesis
(Lavault et al., 2022)	Relevant comparator	Controllable drum waveform GAN
(Caspe et al., 2022)	Relevant comparator	Differentiable FM resynthesis

A common misconception is to treat DexDrummer as only a drum synthesizer or only a transcription model. The cited usage is broader. It covers symbolic inference from mixtures, dataset construction for supervised DTM, and analysis-by-synthesis reconstruction. The shared thread is not a single architecture but a recurring commitment to drum-specific structure: onset events, instrument classes, interpretable control variables, and mappings between symbolic and audio domains.

2. DexDrummer-style automatic drum transcription via partially fixed NMF

In the transcription setting, the input is a recorded musical piece represented by its magnitude spectrogram $\mathbf{V}\in \mathbb{R}_+^{m\times n}$ , where $m$ is the number of frequency bins and $n$ is the number of time frames. The objective is to recover when each drum instrument is active. The factorization model is $\mathbf{V}\approx \mathbf{W}\mathbf{H}$ , with a nonnegative dictionary $\mathbf{W}$ of spectral templates and a nonnegative activation matrix $\mathbf{H}$ . For drum transcription, the formulation is partially fixed: the drum dictionary $\mathbf{W}_D$ is pre-trained and held fixed, while the drum activations $\mathbf{H}_D$ and the harmonic factors $\mathbf{W}_H,\mathbf{H}_H$ are learned by minimizing

$\min_{\mathbf{H}_D,\mathbf{W}_H,\mathbf{H}_H \ge 0} F(\mathbf{H}_D,\mathbf{W}_H,\mathbf{H}_H) = \frac{1}{2}\left\|\mathbf{V}-\left(\mathbf{W}_D\mathbf{H}_D+\mathbf{W}_H\mathbf{H}_H\right)\right\|_F^2.$

The paper emphasizes that this matters because $m$ 0 can be aligned with human-readable drum parts such as snare, bass drum, and hi-hat, making the transcription interpretable and directly convertible to notation (Foster et al., 16 Jul 2025).

Two alternating-minimization optimizers are compared. The multiplicative update rule (MUR) makes each block update nonincreasing for the objective, and the paper proves that, as long as the denominators remain positive, the updates make $m$ 1 nonincreasing. However, MUR does not provide a guarantee that each convex subproblem is solved to any prescribed accuracy. The alternative is projected gradient descent with momentum, extended from Guan et al.’s NeNMF/OGM method. For a subproblem of the form

$m$ 2

the method uses gradient descent with a step size based on $m$ 3, projects to the nonnegative orthant via ReLU, and adds Nesterov acceleration. Its main theoretical advantage is the subproblem guarantee

$m$ 4

which solves the subproblem to within an additive $m$ 5 error.

The time-complexity comparison is explicit. Updating one matrix in either method costs $m$ 6 arithmetic operations, so a single MUR iteration and a single OGM iteration have comparable per-step cost. NeNMF is more expensive overall because each outer alternating step contains $m$ 7 inner OGM steps, making its cost roughly $m$ 8 times that of MUR. In the experiments, this was balanced by running MUR for 100 iterations and NeNMF for 10 outer iterations with 10 inner OGM steps each, giving similar runtimes.

The empirical study uses ENST-Drums, specifically the minus-one subset with 28 tracks of about 70 seconds each sampled at 16 kHz, and an original recording, “Every Now and Then” by the author’s band Threadbare, sampled at 44.1 kHz. Both datasets are converted to magnitude spectrograms using STFT with window length 2048 and hop size 512; the drum dictionary $m$ 9 is built by extracting and time-averaging the spectrogram of individual drum hits for the relevant instruments. Ground truth is used first to match the largest entries of $n$ 0 to the known number of hits in each frame, and second in a median-threshold evaluation in which a detected onset is correct if it falls within 50 ms of a ground-truth onset.

The reported results consistently favor the momentum-based method. For the ground-truth-number-of-hits criterion, the average $n$ 1-scores are 0.599 for MUR and 0.620 for NeNMF on ENST-Drums, and 0.854 for MUR and 0.975 for NeNMF on “Every Now and Then.” Under the median-threshold evaluation, the best reported ENST-Drums score is 0.629 for bass drum with NeNMF, compared with 0.290 for MUR; on “Every Now and Then,” NeNMF scores 0.369 on snare and 0.454 on bass, versus 0.275 and 0.444 for MUR. The paper’s interpretation is that momentum helps because it solves each convex NMF subproblem much more accurately per iteration. A plausible implication is that DexDrummer-style transcription, in this formulation, depends as much on optimizer behavior as on the factorization model itself.

3. ADTOF as the large-scale dataset interpretation of DexDrummer

In the dataset-centered usage, DexDrummer corresponds to ADTOF, a large, real-world, non-synthetic dataset for automatic drum transcription in the presence of melodic instruments (DTM). The dataset was created to address a central limitation of supervised DTM: the public real-music datasets were small, with ENST at 1.02 hours, MDB at 0.35 hours, and RBMA at 1.72 hours. The paper argues that “quantity and realism” are both needed for DTM, and that existing datasets did not provide both together (Zehren et al., 2021).

ADTOF is constructed from custom charts shared online for rhythm games such as Rock Band and PhaseShift. These charts usually include the audio track, drum-note annotations in a MIDI-like symbolic form, beat information, metadata, and sometimes animation labels for in-game characters. The authors downloaded the top-rated 1700 tracks from Rhythm Gaming World and converted them into a standard ADT format. The resulting dataset is described as more than 114 hours of annotated music, with real-world audio rather than synthetic audio and five drum classes after normalization and merging of labels. The paper emphasizes that this makes ADTOF about two orders of magnitude larger than other non-synthetic DTM datasets.

The curation pipeline is a substantive part of the contribution because the original annotations are crowdsourced and error-prone. Two recurring issues are highlighted. The first is inaccurate timing. Typical DTM models require annotations to be roughly within 10 ms of the actual onset, but discrepancies of around 50 ms were often observed. To correct timing, the authors adapted the method of Driedger et al. for beat correction, using Böck’s beat tracker via madmom: estimated beat positions are used to snap the human beat annotations, and the correction is linearly interpolated for notes between beats. A sanity check requires that corrected beats align sufficiently well with the algorithmic beats and that corrections not exceed 80 ms; 140 tracks are removed at this stage.

The second issue is inconsistent labeling. The authors therefore reduce and merge the label set to five ADTOF classes: BD for bass drum, SD for snare drum, TT for toms, HH for hi-hat, and CY + RD for crash cymbal + ride cymbal. They also distinguish between GAMEPLAY annotations, which represent what the player actually performs, and ANIMATION annotations, which are richer labels used to animate the in-game drummer. When ANIMATION annotations are available, they are used to resolve ambiguities and are treated as ground truth where they contradict GAMEPLAY annotations. A final manual and automated quality pass on the 10% of tracks with the lowest prediction score from a preliminary ADT model removes a further 88 tracks.

The validation model is the CRNN from Vogl et al. It takes a log-frequency log-magnitude STFT spectrogram with window size 2048 samples, hop size 441 samples, target frame rate 100 Hz, 12 triangular filters per octave, and frequency range 20 to 20,000 Hz. The output is a set of activations over the five drum classes, each value in $n$ 2, and a peak-picking algorithm converts activations into symbolic note events. Evaluation uses the standard F-measure from mir_eval with a 50 ms tolerance window and an overall SUM score. For ADTOF, the dataset is partitioned into 10 splits with no artist overlap; one split is used as test, one as validation, and the remaining eight as training.

The reported outcomes are important for interpreting DexDrummer as data infrastructure rather than only as an algorithm. When tested on ADTOF, the CRNN performs at a level comparable to its performance on the other real datasets, suggesting that the annotations are mostly consistent with the learned model’s expectations. When trained solely on ADTOF, the model performs as well as or better than the state-of-the-art training recipe that uses synthetic plus real datasets, outperforms the baseline on most evaluations, and generalizes to unseen datasets. At the same time, the paper notes limitations: the dataset is strongly biased toward Rock, heavily Western, and contains relatively few “World” tracks; hi-hat remains relatively weak, likely because some gameplay-specific encodings were not fully corrected.

4. Controllable synthesis references: StyleWaveGAN and DDX7

For DexDrummer understood as a controllable drum generator, the most direct comparator in the cited materials is StyleWaveGAN. It is a conditional, style-based drum sound generator derived from StyleGAN and StyleGAN2 but adapted to 1D audio waveforms. The model generates drum and cymbal sounds from an augmented ENST-Drums subset and explicitly supports kick, snare, toms, closed hi-hat, and open hi-hat. Conditioning is provided by 5 one-hot drum-type labels and by continuous AudioCommons timbral descriptors, notably brightness, depth, and warmth; these controls can be used individually or jointly (Lavault et al., 2022).

Architecturally, StyleWaveGAN replaces 2D convolutions with 1D causal convolutions, uses an averaging filter before each convolution block for upsampling, reduces the mapping network to 4 layers instead of 8, adopts WGAN-LP, adds input and output skips in the generator, and uses a residual discriminator. Noise injection is made style-dependent through

$n$ 3

where $n$ 4 is the layer input, $n$ 5 is the transformed style vector, $n$ 6 is shaped noise shared across channels, and $n$ 7 is a bias term. To reduce noisy tails, the network output is multiplied by a learned class-specific envelope derived from the filtered mean of the analytic signal envelope from the Hilbert transform over normalized training examples:

$n$ 8

The training data start from 350 close-miked ENST-Drums samples and are augmented with SuperVP through transient/attack gain changes, noise component remixing, independent transposition of the source signal, and spectral envelope transposition, yielding about 120k samples totaling roughly 100 hours. Because the class distribution is highly imbalanced—Kick 3%, Snare 18%, Toms 45%, Closed hi-hat 10%, Open hi-hat 22%—the paper proposes equal-proportion sampling, which draws uniformly from class-specific sub-datasets. Training uses 2 million samples, batch size 10, and about 200k iterations. The paper also proposes AutoFade,

$n$ 9

as an alternative to progressive growing, though the final interpretation is that there is no benefit from progressive growing or AutoFade in the generator, while progressive growing in the discriminator improves results.

The practical headline claims are 44.1 kHz output, 1.5 s maximum duration, faster-than-real-time synthesis on a consumer GPU, and 52 drum sounds/s on a GTX 1080. Quality is evaluated with Frechet Audio Distance (FAD). At 16 kHz, NeuroDrum yields 25.35 and StyleWaveGAN@16 kHz yields 11.48. At 44.1 kHz, a WaveGAN baseline yields 13.08, StyleWaveGAN yields 7.75, StyleWaveGAN + AutoFade yields 6.84, and the best reported configuration—StyleWaveGAN + labels + AF + balanced data + envelope—yields 3.62. Descriptor-control evaluation reports nearly perfect ordering accuracy for brightness control at 1.00 / 0.94 / 0.98 in the three tests, and typical $\mathbf{V}\approx \mathbf{W}\mathbf{H}$ 0 values around 0.75 for brightness single, 0.70 for depth single, and 0.76 for warmth single.

A more general but still relevant synthesis reference is DDX7, a differentiable, audio-driven FM resynthesis system. DDX7 constrains the synthesis space by fixing the oscillator routing, fixing the frequency ratios from DX7-style preset patches, restricting the modulation index range to $\mathbf{V}\approx \mathbf{W}\mathbf{H}$ 1, using pitch conditioning, removing DX7 feedback, and learning only compact envelope-like controls in the form of six framewise oscillator output levels (Caspe et al., 2022). The basic FM equation is

$\mathbf{V}\approx \mathbf{W}\mathbf{H}$ 2

with the sideband expansion

$\mathbf{V}\approx \mathbf{W}\mathbf{H}$ 3

The authors’ main optimization argument is that standard multi-scale spectral reconstruction losses provide weak pitch gradients, so FM resynthesis becomes more tractable when sideband locations are pre-aligned by design.

DDX7 is trained on URMP violin, flute, and trumpet stems, downsampled to 16 kHz and clipped into 4-second excerpts. Pitch is extracted with CREPE, loudness is A-weighted, the hop size is 64 samples, and the frame rate is 250 Hz. The causal TCN has 5 residual blocks, 128 hidden channels per block, kernel size 3, skip connections, weight normalization, dropout 0.5, ReLU activations, and about 400k parameters. Evaluation again uses FAD. Reported results show that HpN is better on violin and trumpet, DDX7 is better than HpN on flute, and the best DDX7 setting varies by instrument. The paper concludes that some constrained FM patches are more learnable than others and that patch choice matters a great deal. For DexDrummer, the significance is indirect but clear: compact, interpretable synthesis can work, but the synthesis structure itself is a first-order design variable.

5. Drum-machine inversion: the Inverse Drum Machine as a DexDrummer-style separator

The Inverse Drum Machine (IDM) formalizes a drum-machine-style source-separation paradigm in which separation is achieved by transcribing drum events, synthesizing one-shot samples, and sequencing those samples back into individual stems and the mixture. The input is a drum mixture $\mathbf{V}\approx \mathbf{W}\mathbf{H}$ 4, cropped to random 8-second segments during training; the paper explicitly states that the experiments use drum-only recordings, not general music mixtures (Torres et al., 6 May 2025).

Feature extraction converts the waveform to a log-mel spectrogram with sample rate 16 kHz, 250 mel bands, window size 1024, and hop size 256 samples. A ConvNeXt encoder produces frame-level features $\mathbf{V}\approx \mathbf{W}\mathbf{H}$ 5 with $\mathbf{V}\approx \mathbf{W}\mathbf{H}$ 6. From these features, the model predicts onset activations $\mathbf{V}\approx \mathbf{W}\mathbf{H}$ 7, velocity activations $\mathbf{V}\approx \mathbf{W}\mathbf{H}$ 8, a mixture embedding $\mathbf{V}\approx \mathbf{W}\mathbf{H}$ 9, and track gains $\mathbf{W}$ 0. The activation signal is formed as

$\mathbf{W}$ 1

A key training detail is that ground-truth onsets are used in the activation path during training, so reconstruction does not backpropagate through the onset branch; peak picking is applied to predicted onset activations at inference.

The one-shot synthesis component is a TCN-based synthesizer that takes 1 second zero-padded white noise and transforms it into a drum sound. Conditioning is built by concatenating the mixture embedding and a one-hot encoding of the instrument class, then modulating the TCN by FiLM:

$\mathbf{W}$ 2

To model percussive decay, the output is multiplied by an exponential envelope,

$\mathbf{W}$ 3

The synthesizer uses 10 dilated causal convolution layers, residual connections, GELU activations, latent dimension 48, kernel size 15, receptive field about 767 ms, instance normalization, and a final output normalized to $\mathbf{W}$ 4.

Sequencing combines the synthesized one-shot, the activation sequence, and a global gain per class:

$\mathbf{W}$ 5

Training is end-to-end with three losses: a multi-resolution STFT reconstruction loss over $\mathbf{W}$ 6, binary cross-entropy transcription loss on onsets, and a cross-entropy mixture-embedding loss against a one-hot drum-kit label. This yields a weakly supervised system that requires transcription annotations and drum-kit labels, but not isolated stems. The paper also evaluates an $\mathbf{W}$ 7-Wiener masking variant with $\mathbf{W}$ 8.

Evaluation is conducted on StemGMD, built from the GMD MIDI dataset with synthesized isolated drum stems. The corpus contains about 136 hours, 9 drum classes, and 10 drum kits, of which 6 are used for comparability with LarsNet. Results are reported for a 9-class setting and a 5-class evaluation setting that groups kick, snare, hi-hats, toms, and cymbals. Baselines include NMFD 1A, NMFD 1B, NMFD 3, LarsNet, and LarsNet Mono. Metrics include Precision and Recall for transcription; SI-SDR for masked outputs; LSD; and Predicted Energy in Silence (PES).

The reported findings show that IDM achieves high precision and recall, both in the 90s percentile range across most instruments, and significantly outperforms NMFD baselines in transcription. For separation, IDM substantially outperforms NMFD across metrics. Compared with LarsNet, IDM is slightly worse on masked metrics overall, though it sometimes slightly beats LarsNet Mono on LSD. Direct synthesis is particularly notable: IDM achieves better silence prediction than the supervised baseline and better LSD than LarsNet for most instruments except toms and cymbals; for many classes, direct synthesis yields lower LSD than masking. Parameter efficiency is a headline result: IDM has 465 parameters, whereas LarsNet has 49.1 million parameters.

The paper also records several limitations. Performance improves further when ground-truth onsets are supplied at inference, indicating that onset accuracy remains a bottleneck. The one-shot duration is fixed to 1 second, which may be too short for long-decay cymbals and hi-hats. The mixture embedding is tied to drum kits seen during training. Evaluation is restricted to drum-only mixtures. These constraints delimit the current scope of DexDrummer-style analysis-by-synthesis separation.

6. Recurring design principles, points of tension, and likely directions

Across the cited literature, a recurring design principle is explicit structural bias toward drum semantics. In the NMF transcription setting, the fixed drum dictionary $\mathbf{W}$ 9 is aligned with human-readable parts such as snare, bass drum, and hi-hat (Foster et al., 16 Jul 2025). In ADTOF, raw chart vocabularies are merged into five drum classes to stabilize large-scale supervision (Zehren et al., 2021). In StyleWaveGAN, generation is conditioned on 5 one-hot drum-type labels and descriptor controls such as brightness, depth, and warmth (Lavault et al., 2022). In IDM, one-hot instrument classes, onset activations, velocity activations, and drum-kit embeddings define the separation pipeline (Torres et al., 6 May 2025). In DDX7, controllability is preserved by fixing FM routing and frequency ratios rather than learning an unconstrained latent synthesizer (Caspe et al., 2022). This suggests a common preference for interpretable intermediate variables over fully opaque end-to-end parameterizations.

A second recurring tension is the relationship between optimization guarantees and empirical quality. The partially fixed NMF study explicitly contrasts MUR, which guarantees only monotone decrease of the objective, with projected gradient descent with Nesterov momentum, which provides an $\mathbf{H}$ 0 subproblem guarantee and better transcription quality at matched runtime (Foster et al., 16 Jul 2025). DDX7 makes a related but synthesis-oriented argument: standard spectral losses alone are often insufficient unless the synthesis parameterization is constrained so that optimization becomes well behaved (Caspe et al., 2022). A plausible implication is that DexDrummer research repeatedly encounters the same engineering fact in different forms: model structure and optimizer structure cannot be separated cleanly.

A third tension concerns supervision. ADTOF responds to the scarcity of realistic labeled mixtures by mining and correcting crowdsourced rhythm-game charts (Zehren et al., 2021). IDM reduces the supervision burden relative to stem-based separation by requiring onset annotations and kit labels rather than isolated stems (Torres et al., 6 May 2025). Yet both lines of work remain sensitive to annotation quality. ADTOF required beat correction, label merging, and track removal; IDM improves when ground-truth onsets are supplied at inference. This undermines any simple opposition between “data-driven” and “structured” approaches: both depend critically on the fidelity of event-level supervision.

Several limitations are also consistent across the literature. Dataset bias remains explicit in ADTOF, which is strongly biased toward Rock and heavily Western (Zehren et al., 2021). Control fidelity in StyleWaveGAN is strong for brightness but weaker for warmth outside the training range (Lavault et al., 2022). DDX7 is not uniformly better than a larger HpN baseline, and its performance depends heavily on patch choice (Caspe et al., 2022). IDM is evaluated only on drum-only mixtures and can be constrained by its 1-second one-shot assumption (Torres et al., 6 May 2025). These are not contradictions so much as boundary conditions: drum-specific structure helps, but each formulation inherits the limits of its supervision regime, control space, and evaluation domain.

Taken together, the papers define DexDrummer less as a single named system than as a convergent research program. Its characteristic elements are interpretable representations of drum events, audio models that retain some user-meaningful control, and learning pipelines that move between mixtures, symbolic drum descriptions, and reconstructed sound. The cited work shows that these elements can support transcription, dataset construction, generation, and source separation, but it also shows that each objective imposes a different bottleneck—optimizer accuracy, annotation reliability, synthesis controllability, or onset precision.