Joint Time-Frequency Scattering (JTFS)
- Joint Time-Frequency Scattering is a multilayer wavelet transform that produces locally invariant representations of audio signals by cascading constant-Q and 2D joint convolutions.
- It integrates a constant-Q Morlet filterbank with 2D joint time–frequency wavelets to capture complex modulations such as vibrato, tremolo, and transient textures.
- JTFS provides stable, discriminative features for applications like audio texture synthesis, perceptual sound matching, and classification by balancing time and frequency resolution.
Joint Time-Frequency Scattering (JTFS) is a multilayer, convolutional-wavelet transform yielding a structured, locally invariant representation of the spectrotemporal modulations of audio signals. Originally framed as an extension of time scattering, JTFS integrates a constant-Q Morlet filterbank in time with a two-dimensional wavelet transform along both time and log-frequency axes, followed by modulus nonlinearities and local averaging. This architecture produces a feature space with high discriminative power for complex nonstationary phenomena, such as frequency and amplitude modulation, tremolo, vibrato, and transient-rich textures, while providing mathematical guarantees of time-shift invariance and Lipschitz stability to deformations (Andén et al., 2015, Andén et al., 2018, Lostanlen et al., 12 Feb 2026, Muradeli et al., 2022).
1. Mathematical Formulation and Cascade Architecture
JTFS operates by cascading wavelet convolution, modulus, and local averaging in multiple layers, extending one-dimensional time scattering to a two-dimensional time–frequency setting:
- First-layer: Constant-Q Temporal Wavelets
- Input (finite energy).
- Morlet wavelets , where is quality factor and ensures zero mean. Frequencies cover the frequency axis at constant-Q resolution.
- Compute the first-layer modulus "scalogram":
- Second-layer: 2D Joint Time–Frequency Wavelets
- Temporal modulation wavelets (center rate , Q=1), and spectral modulation wavelets (center scale , Q=1), both discretized geometrically (0).
- The joint filter is the tensor product:
1
The joint 2D convolution, followed by modulus, isolates spectrotemporal modulations:
2
Local Averaging and Invariance
- Temporal and frequency domain low-pass filters (3, 4), yielding:
5
- Local shift-invariance up to scales 6 (time) and 7 (frequency, in octaves).
Scattering Feature Vector
- JTFS features are typically the concatenation of 8 and 9, often after stabilized logarithmic compression:
0
The feature dimensionality depends on the configuration of the filterbanks, typically 1–2 for audio textures with realistic parameter sweeps (Mitcheltree et al., 11 Feb 2026, Han et al., 2023, Muradeli et al., 2022, Andén et al., 2015).
2. Tiling of the Time–Frequency Plane and Parameterization
JTFS achieves a detailed tiling of the time–frequency plane through joint selection of filter center frequencies and bandwidths:
The first-layer wavelets implement a constant-Q (log-frequency) coverage for high frequencies, with bandwidth limited by 3, transitioning to constant bandwidth below an "elbow scale".
The second layer's 2D wavelets form a separable basis on the scalogram, localizing joint modulations in rectangles of size 4 (time) and 5 (log-freq), permitting analysis of localized and oriented Gabor-like atoms.
Hyperparameters include:
- 6: octaves, 7: temporal quality factor (typ. 8–12).
- 8, 9: second-layer quality factors, often 1.
- 0: temporal and frequency averaging windows.
- Subsampling factors may be introduced for computational tractability, dependent on the redundancy of the wavelet cover (Lostanlen et al., 12 Feb 2026, Andén et al., 2015, Lostanlen, 2018).
3. Computational Realization and Differentiable Implementations
JTFS can be efficiently implemented using FFT-based convolutions for both 1D and 2D wavelet transforms. The critical operators are differentiable, supporting integration with deep learning frameworks. Notable open-source implementations include Kymatio and its higher-level derivatives:
- GPU kernels apply FFT-based convolution, modulus, and 2D average pooling (Muradeli et al., 2022).
- Pseudocode exposes sequential layering: constant-Q filtering, 1D modulus+pooling, 2D joint convolution+modulus+pooling, and aggregation over scattering paths (Lostanlen et al., 12 Feb 2026, Muradeli et al., 2022).
- Advanced stochastic optimization schemes such as SCRAPL further address computational burden by randomly sampling over JTFS paths and employing adaptive stochastic gradient techniques (P-Adam, P-SAGA, θ-importance sampling). This provides ~20× acceleration at modest cost in perceptual accuracy when JTFS is used as a differentiable loss (Mitcheltree et al., 11 Feb 2026).
| Library | Backend | Differentiable | GPU Support | Notes |
|---|---|---|---|---|
| Kymatio | NumPy/PyTorch/TF | Yes | Yes | Widest support |
| scattering.m | MATLAB/Python | Partial | No (MATLAB) | Early reference |
| SCRAPL | Python | Yes | Yes | JTFS loss focus |
4. Theoretical Guarantees and Stability Properties
JTFS inherits and extends Mallat's scattering transform guarantees:
- Translation Invariance: Averaging with 1 produces invariance to time shifts up to 2 and frequency shifts up to 3 (Andén et al., 2015, Lostanlen et al., 2020).
- Lipschitz Stability: The modulus-convolution cascade is provably stable to time-warpings and small deformations in 4, implying robust embeddings under pitch- or rate-modulated transformations (Andén et al., 2018, Andén et al., 2015, Czaja et al., 2016).
- Energy Preservation and Exponential Decay: The sequence of modulus convolutions with uniform covering frames retains most energy in low-depth, frequency-decreasing paths, and total energy in deeper layers decays exponentially, supporting practical truncation at second order (Czaja et al., 2016).
- Approximate Invertibility: For sufficiently deep and dense JTFS, the mapping is invertible up to global translations, and phase retrieval through gradient-based optimization enables convincing audio texture resynthesis (Lostanlen et al., 2019).
5. Comparison with Related Spectrotemporal Representations
JTFS generalizes and improves upon prior biologically and physically motivated spectrotemporal feature extractors:
- Spectrotemporal Receptive Fields (STRF) and Gabor Filterbanks (GBFB): Employ 2D Gabor-like filtering on spectrograms but use linear, non-cascaded filtering and typically do not employ nonlinearity or multiresolution parameterization.
- Modulation Power Spectrum (MPS): Computes the 2D Fourier transform of the squared-magnitude STFT, capturing global modulation statistics but sacrificing localization and invertibility (Lostanlen et al., 12 Feb 2026, Lostanlen et al., 2019).
- JTFS is distinguished by: (i) multilayer wavelet cascades with complex modulus, (ii) explicit adaptation of bandwidths across scales (constant-Q/elbow construction), (iii) controlled local invariance through separable averaging, and (iv) Lipschitz continuity for stability (Lostanlen et al., 12 Feb 2026, Andén et al., 2015).
6. Applications in Machine Learning, Audio Analysis, and Synthesis
JTFS is employed in diverse applications including:
- Texture Synthesis and Musical Metamerism: JTFS coefficients serve as differentiable statistics for audio resynthesis and metamer generation via inverse optimization; JTFS-based synthesis demonstrates improved preservation of spectrotemporal structure relative to MFCC, pure time-scattering, or MPS, key for “musical metamerism” (Lostanlen et al., 12 Feb 2026, Lostanlen et al., 2019).
- Perceptual Sound Matching and Inverse Problems: Integration with neural synthesis pipelines (e.g., DDSP) enables perceptually meaningful, gradient-friendly loss functions for sound matching tasks; JTFS-based loss functions such as the PNP framework yield state-of-the-art results and 100× acceleration via precomputation and quadratic-model approximations (Han et al., 2023).
- Classification and Retrieval: JTFS features assigned to simple linear metric learning suffice to recover human similarity judgments between instrumental techniques, outperforming MFCC and shallow scattering representations on isolated-note clustering and instrument/technique identification (Lostanlen et al., 2020). In TIMIT phone classification and acoustic scene classification, JTFS outperforms both hand-engineered and learned CNN/MLP features, especially in data-scarce conditions (Andén et al., 2015, Muradeli et al., 2022).
- Unsupervised and supervised feature learning: JTFS has been shown to better linearize independent generative factors in manifold learning tasks compared to MFCC, Scat1D, STRF, and OpenL3 (Muradeli et al., 2022).
7. Parameter Regimes, Design Trade-Offs, and Practical Guidelines
The discriminative capacity and invariances of JTFS are tunable via its parameters:
- Typical first-order temporal Q: 8–12 per octave, matching human auditory discrimination; second-order modulations: Q=1, a few filters per octave (Han et al., 2023, Andén et al., 2015).
- Averaging window 5 is chosen to match the time scale over which local invariance is desired (hundreds of ms for musical notes).
- Spectral averaging 6 provides invariance to pitch or timbral transpositions; often 1–2 octaves for note-level analyses (Lostanlen et al., 12 Feb 2026).
- Multirate downsampling is applied in each layer proportional to the effective bandwidth, reducing feature dimensionality by up to 30× vs. fully sampled 2D STRF grids (Lostanlen et al., 2020).
- Logarithmic or median-based nonlinearity stabilizes dynamic range prior to use in learning or distance metrics (Han et al., 2023, Lostanlen et al., 2020).
- Stochastic or importance path sampling (SCRAPL) enables practical use as a loss for deep stochastic gradient descent (Mitcheltree et al., 11 Feb 2026).
| Parameter | Typical values | Effect |
|---|---|---|
| Q (temporal) | 8–12 / octave | Frequency vs. time resolution |
| T (avg window) | 100–500 ms | Time shift invariance |
| F (spectral avg) | 1–2 octaves | Frequency invariance |
| Q_mod (2D layer) | 1 / octave | Joint modulation resolution |
| Downsampling | Critical (by α, β) | Dimensionality reduction |
Empirical results support JTFS as the currently most effective handcrafted auditory representation for nonstationary timbre, transient structure, and cross-frequency modulation analysis (Lostanlen et al., 2020, Han et al., 2023, Andén et al., 2015, Muradeli et al., 2022).