Papers
Topics
Authors
Recent
2000 character limit reached

wav2shape: Hearing the Shape of a Drum Machine

Published 20 Jul 2020 in cs.SD, cs.LG, and eess.AS | (2007.10299v1)

Abstract: Disentangling and recovering physical attributes, such as shape and material, from a few waveform examples is a challenging inverse problem in audio signal processing, with numerous applications in musical acoustics as well as structural engineering. We propose to address this problem via a combination of time--frequency analysis and supervised machine learning. We start by synthesizing a dataset of sounds using the functional transformation method. Then, we represent each percussive sound in terms of its time-invariant scattering transform coefficients and formulate the parametric estimation of the resonator as multidimensional regression with a deep convolutional neural network. We interpolate scattering coefficients over the surface of the drum as a surrogate for potentially missing data, and study the response of the neural network to interpolated samples. Lastly, we resynthesize drum sounds from scattering coefficients, therefore paving the way towards a deep generative model of drum sounds whose latent variables are physically interpretable.

Citations (7)

Summary

  • The paper introduces a novel inverse framework that extracts drum physical parameters using PDE-based modeling, scattering transforms, and CNN regression.
  • It leverages time–frequency scattering to capture stable, higher-order modulations critical for delineating membrane vibration characteristics.
  • Experimental results confirm robust parameter estimation and audio reconstruction, enabling continuous, physically informed sound synthesis.

wav2shape: Reconstruction and Identification of Drum Physical Parameters from Audio

Context and Motivation

The challenge of inferring physical attributes—such as membrane shape, modal parameters, and damping—from audio recordings is central in musical acoustics and digital audio synthesis. Existing drum transcription approaches are largely taxonomic, segmenting percussive sounds into discrete classes and overlooking the rich continuous variability caused by instrument geometry and playing technique. This limits both MIR capabilities and generative synthesis accuracy. The "wav2shape: Hearing the Shape of a Drum Machine" (2007.10299) paper develops a principled inverse framework that marries PDE-based physical modeling, feature engineering via scattering transforms, and supervised representation learning to estimate physically interpretable parameters of drum sounds from audio samples. Figure 1

Figure 1: Diverse drums spanning cultures and centuries, illustrating the geometric and material variation that underpins timbral diversity.

Physical Modeling of Drum Sound Synthesis

The work formalizes membrane vibration as a fourth-order PDE with resonant (wave speed, stiffness) and dissipative (air drag, boundary coupling) terms. The solution is obtained via the Functional Transformation Method (FTM), yielding a modal representation whose coefficients directly encode physical shape and material properties: fundamental frequency (ω\omega), sustain (τ\tau), frequency-dependent damping (pp), inharmonicity (DD), and aspect ratio (α\alpha). These parameters enable fine-grained synthesis with transparent physical control.

The authors implement a real-time VST plugin utilizing this parametric model. User-controllable GUI elements (Figure 2) map directly to the latent physical parameters, facilitating intuitive exploration of drum timbres. Figure 2

Figure 2: wav2shape real-time VST plugin GUI, exposing physically interpretable controls (e.g., sustain, roundness, inharmonicity).

Feature Engineering: Time–Frequency Scattering Transform

Standard descriptors (MFCC, CQT) fail to robustly encode nonstationary percussive signals. Instead, the paper leverages the scattering transform—a cascade of wavelet modulus and averaging operations—which yields features that are stable to deformation, invariant to affine transforms (e.g., gain, bias), and capable of demodulating fast phase variation inherent to drum attacks. Second-order coefficients, in particular, capture higher-order spectrotemporal modulations critical for distinguishing membrane physics.

Regression Architecture: Deep Convolutional Network

Scattering coefficients are log-transformed and provided as input to a 1D CNN (wav2shape) comprising four convolutional blocks followed by fully connected layers. The network is trained using Adam with MSE loss to estimate the 5D vector of physical parameters from audio samples. Hyperparameters for scattering scale (JJ) and order (NN) are extensively tuned.

Regression Performance and Robustness

Experiments demonstrate that second-order (N=2N=2) scattering coupled with J=8J=8, ε=10−3\varepsilon=10^{-3} scaling achieves superior regression accuracy. Validation loss is an order of magnitude lower (~0.013) than uniform random guessing in the normalized parameter space (0.87), confirming effective generalization beyond the training set. The distribution of regression errors by parameter reveals best performance for ω\omega and τ\tau, with diminished precision for DD and pp, attributable to modal truncation and physical influence profiles. Figure 3

Figure 3

Figure 3

Figure 3: Training curves and regression error breakdown over physical parameters under different scattering hyperparameters.

Stroke Location Interpolation and Feature Linearization

Evaluating the system's ability to generalize to off-center drum strokes, scattering coefficients are linearly interpolated over the membrane surface. Heatmaps of Laplacian norms (Figure 4) demonstrate that the scattering domain is approximately locally linear with respect to excitation location—an improvement over the Fourier modulus domain—facilitating transfer beyond the central stroke regime. Figure 4

Figure 4

Figure 4

Figure 4

Figure 4: Laplacian heatmaps comparing scattering and Fourier features; scattering exhibits reduced curvature and higher fidelity for interpolated sound localization.

Regression accuracy on interpolated scattering features, though lower than direct synthesis, remains substantially better than chance, suggesting the network's learned mapping is robust to moderate spatial variance. Figure 5

Figure 5: Prediction error distributions for synthesized and interpolated scattering features across dataset splits.

Inversion: Audio Reconstruction from Scattering Representations

The paper advances beyond parameter estimation by reconstructing time-domain audio from scattering features using gradient-based optimization (reverse-mode autodiff in PyTorch/Kymatio). Reconstruction quality is controlled by scattering scale and order: deeper (order 2) representations enable sharper and less artifact-prone audio, particularly for larger time-averaging scales. This paves the way for transforming wav2shape from a discriminative regression model to a generative audio model controllable via interpretable physics. Figure 6

Figure 6: Spectrograms of original and reconstructed drum sounds from first and second-order scattering features at varying scales, exhibiting trade-offs between sharpness and invariance.

Implications and Future Directions

Practically, wav2shape enables physically informed audio synthesis, sound classification, and model-based MIR tasks with a high degree of expressive control. Theoretically, this work substantiates the utility of scattering representations for bridging nonlinear physical acoustics and deep learning. The demonstrated robustness to off-center strokes implies the potential extensibility to arbitrary playing technique, though interpolation error remains a limiting factor.

Critical challenges for future research include:

  • Enhancing regression fidelity for inharmonicity and geometric shape parameters across modes.
  • Addressing generalization limitations for arbitrary stroke locations and drum geometries (e.g., circular).
  • Reducing reliance on large simulated datasets via sim2real transfer, potentially leveraging unsupervised or reinforcement learning paradigms.
  • Integrating scattering-based generative audio models (GANs, inverse networks) for physically interpretable audio synthesis in production environments.

Conclusion

wav2shape (2007.10299) comprises a robust framework for the supervised recovery of physically meaningful drum shape parameters from audio, leveraging scattering transforms and deep neural architectures. Its ability to generalize across a physically simulated parameter space and reconstruct time-domain audio from interpretable features underscores the utility of hybrid physical–data-driven modeling in musical acoustics. The approach opens avenues for continuous, physically grounded control of percussive sound synthesis, and motivates future integration of physical modeling with advanced generative and learning methodologies in both research and applied audio technology contexts.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Authors (2)

Collections

Sign up for free to add this paper to one or more collections.