Sampling-Frequency-Independent Audio Source Separation Using Convolution Layer Based on Impulse Invariant Method
Abstract: Audio source separation is often used as preprocessing of various applications, and one of its ultimate goals is to construct a single versatile model capable of dealing with the varieties of audio signals. Since sampling frequency, one of the audio signal varieties, is usually application specific, the preceding audio source separation model should be able to deal with audio signals of all sampling frequencies specified in the target applications. However, conventional models based on deep neural networks (DNNs) are trained only at the sampling frequency specified by the training data, and there are no guarantees that they work with unseen sampling frequencies. In this paper, we propose a convolution layer capable of handling arbitrary sampling frequencies by a single DNN. Through music source separation experiments, we show that the introduction of the proposed layer enables a conventional audio source separation model to consistently work with even unseen sampling frequencies.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a concise list of unresolved issues and concrete research directions that emerge from the paper’s methods, assumptions, and evaluations:
- End-to-end SFI design is incomplete: only the encoder/decoder are made SFI. The masking modules (temporal conv stacks with fixed sample-based dilations/kernels) remain sampling-rate dependent. How to make all convolutional blocks, dilations, and normalization layers sampling-frequency-independent while preserving stability and performance?
- Group normalization and other sample-based layers: the interaction between SFI layers and group normalization (and other normalization/regularization layers) at varying sampling rates is not addressed. What normalization strategies remain invariant in continuous time and work reliably across Fs?
- Training at a single sampling frequency: models are trained only at 16 kHz and tested on unseen Fs. How does performance change under multi-Fs training (mixed batches or curriculum) and does it further improve generalization or reduce aliasing artifacts?
- High-frequency content loss at higher Fs: when applying a 16 kHz–trained model at 32–48 kHz, filters effectively suppress frequencies above ~8 kHz, limiting full-band separation (e.g., cymbals, sibilance). How to retain or recover high-frequency details at higher Fs (e.g., via multi-band training, high-frequency specialist branches, or residual enhancement)?
- Aliasing mitigation is heuristic: zeroing channels whose center frequencies exceed Nyquist ignores bandwidth tails and transition regions, so aliasing can persist even when fc < Nyquist. Can principled anti-aliasing be incorporated (e.g., adaptive lowpass pre-filtering, tapered windowing of impulse responses, oversample-then-decimate within the layer, or learnable anti-alias filters)?
- Choice of analog-to-digital conversion: only the impulse invariant method is explored. How do alternative conversions (e.g., bilinear transform, matched-z, step-invariant, exact discretization of gammatone responses) or direct bandlimited interpolation affect performance and aliasing?
- FIR truncation/windowing effects: impulse responses are sampled and abruptly truncated to length L, potentially causing spectral leakage. Would applying differentiable windowing/tapering or optimizing L per Fs reduce artifacts and improve separation?
- Fixed MP-GTF parameterization: only center frequency f_m and phase phi_m are trained; bandwidth b_m, order p_m, and amplitude a_m are fixed/normalized. Does jointly learning bandwidths, orders, gains, or moving beyond gammatone (e.g., sums of damped sinusoids, Sinc/Butterworth/elliptic prototypes) increase flexibility and accuracy across Fs?
- Frequency allocation strategy: f_m are defined in absolute Hz. At higher Fs, more filters fall below Nyquist; at lower Fs, capacity is reduced. Would a relative (fraction-of-Nyquist) parametrization or learnable frequency warping yield more uniform capacity across sampling rates?
- Computational scaling with Fs: L and W scale to keep time windows constant, increasing kernel sizes and compute at high Fs. What are the runtime/memory costs for real-time deployment at 48–96 kHz, and can multi-rate or subband architectures reduce complexity while remaining SFI?
- Stability/ordering constraints: the model initializes f_m on an ERB grid but does not enforce monotonic ordering or minimum spacing during training. Do ordering/spacing constraints or regularizers improve coverage, reduce redundancy, and stabilize training?
- Consistency across batches and dynamic Fs changes: the layer regenerates weights when Fs changes, but behavior under frequent or per-utterance Fs changes, or mixed-Fs mini-batches, is not studied. What are best practices for caching, numerical stability, and training dynamics in such settings?
- Masking-module receptive field mismatch: dilations and kernel sizes in the masking network are sample-based, so the temporal receptive field (in ms) changes with Fs. What is the impact on temporal modeling and can continuous-time dilations or SFI temporal convolutions restore invariance?
- Evaluation limited to re-sampled MUSDB18-HQ: tests rely on resampled versions of the same dataset. How does the method perform on native recordings at diverse Fs (44.1, 48, 96 kHz) and across other domains (speech, environmental audio) to validate generality?
- Baseline breadth: comparisons exclude current SOTA music separation models (e.g., Demucs variants) and multi-Fs training baselines (e.g., stacked/multi-branch or bandwidth-expansion approaches) evaluated on unseen Fs. A head-to-head assessment is needed to contextualize gains and trade-offs.
- Stereo/spatial information not leveraged: left/right channels are processed independently. How does the SFI approach extend to multichannel separators that exploit spatial cues across varying Fs?
- Perceptual quality and artifact analysis: evaluations focus on SDR; subjective or perceptual metrics (e.g., MUSHQ listening tests, PESQ/ESTOI where relevant) and high-frequency artifact analyses are missing, especially at higher Fs where bandwidth truncation occurs.
- Robustness to resampling pipelines: different SRC filters/codecs introduce distinct bandlimits and aliasing. How sensitive is the SFI layer to real-world resampling artifacts and device-specific front-ends?
- Generalization to other layer types: applicability of SFI concepts to non-convolutional architectures (e.g., attention/transformers, state-space models) and hybrid time–frequency designs remains unexplored.
- Learnable anti-alias gating: the aliasing “zero-out” rule is non-differentiable and tied to Fs. Can a differentiable gating/regularization scheme learn to attenuate problematic bands while preserving useful near-Nyquist information?
- Downstream task integration: the paper motivates SFI as a universal preprocessor but does not evaluate end-to-end gains when coupled with downstream systems (ASR, transcription, beat tracking) operating at their native Fs. Do SFI separators improve downstream robustness without retraining?
- Error bars and statistical testing: standard errors are shown over random seeds, but statistical significance across sampling rates and instruments (and across tracks) is not formally tested. A more rigorous statistical analysis would strengthen claims of “consistent performance.”
- Upper/lower Fs limits: the method is evaluated from 8 to 48 kHz. How does it behave at extreme rates (e.g., 96 kHz for high-resolution audio, <8 kHz for telephony), and what adaptations are needed for stability and performance?
- Real-time latency guarantees: keeping window lengths constant in time suggests bounded latency, but actual end-to-end latency and variability across Fs are not reported. Can the approach meet real-time constraints uniformly across sampling rates?
Collections
Sign up for free to add this paper to one or more collections.