STSM-FiLM: Neural TSM with FiLM Conditioning
- The paper introduces a neural architecture leveraging FiLM conditioning to achieve robust time-scale modification of speech while preserving pitch.
- It employs an encoder–FiLM–decoder paradigm with four encoder–decoder variants, validated by metrics such as PESQ and STOI for perceptual quality.
- FiLM conditioning enables continuous control of playback speed across a wide range of α, outperforming traditional WSOLA methods under non-stationary conditions.
STSM-FiLM is a fully neural architecture for time-scale modification (TSM) of speech, designed to alter the playback rate of audio without affecting pitch. It leverages Feature-Wise Linear Modulation (FiLM) as a continuous conditioning mechanism on the speed factor, aiming to outperform classical approaches such as Waveform Similarity-based Overlap-Add (WSOLA), especially under non-stationary or extreme time-scaling. STSM-FiLM employs an encoder–FiLM–decoder paradigm, supervised on WSOLA-generated pseudo-ground truth, and supports four distinct encoder–decoder variants, enabling robust generalization and high perceptual consistency across a wide range of time-scaling factors (Wisnu et al., 3 Oct 2025).
1. Neural Architecture and Workflow
The STSM-FiLM architecture follows an explicit encoder–conditioning–decoder workflow: where is the input waveform, is the speed factor, are intermediate features, and is the TSM output.
The process comprises:
- Encoding: Input waveforms are projected to high-level features. The choice of encoder and decoder varies by system variant.
- FiLM Conditioning: Features are modulated via FiLM layers parametrized by .
- Temporal Interpolation: The conditioned feature sequence is interpolated in time to achieve the target length .
- Decoding: The modified sequence is reconstructed into a time-scaled waveform.
Ground-truth targets are constructed by applying WSOLA to the input: .
2. Encoder–Decoder Variants
Four encoder–decoder configurations are supported, each reflecting different feature priors and reconstruction strategies:
| System | Encoder Features | Decoder |
|---|---|---|
| STFT-HiFiGAN | ($1024$-dim log-mel) | HiFi-GAN vocoder |
| WavLM-HiFiGAN | Layer 6 WavLM-Large ($1024$-dim) | HiFi-GAN |
| Whisper-HiFiGAN | Last layer Whisper-Medium ($1024$-dim) | HiFi-GAN |
| EnCodec | EnCodec quantized codes | EnCodec decoder |
- In STFT-, WavLM-, and Whisper-based systems, FiLM layers are applied after each convolutional block in the HiFi-GAN generator.
- In EnCodec, FiLM is injected prior to quantization, and interpolation operates in latent space before decoding.
These designs enable trade-offs in fidelity and robustness, with spectral encoders (STFT) providing highest SNR and contextual encoders (WavLM) maximizing perceptual naturalness and ASR performance.
3. FiLM Conditioning Mechanism
FiLM conditioning enables continuous control over the time-scaling parameter:
- The speed factor is mapped to affine modulation parameters via a small MLP.
- For each channel , feature-wise linear modulation is performed:
- The additive identity (“” on ) ensures stable training and preserves the initial representation.
This conditioning is realized across all FiLM layers in the generator. In EnCodec, FiLM is applied directly to (pre-quantization). The ability to interpolate facilitates seamless generalization over a continuous range of speed factors without requiring separate models.
4. Training Objectives and Supervision
Supervision is entirely pseudo-grounded on WSOLA targets. The overall loss comprises:
- is a HiFi-GAN style adversarial loss:
- is a feature-matching loss over internal discriminator features:
No explicit alignment or regularization terms are required beyond these objectives.
5. Time-Scale Modification and Signal Processing
Time-scale modification is parameterized by :
- induces time-expansion (slower playback).
- indicates time-compression (faster playback).
Resampling in the neural feature domain proceeds as linear interpolation from to along the time axis: This resampled sequence is decoded back to waveform using the respective decoder. The pipeline thereby mimics classical WSOLA alignment/synthesis with deep, learned representations.
6. Empirical Evaluation and Generalization
STSM-FiLM is trained on VCTK (English) and TMHINT-QI (Mandarin) at 16 kHz, with sampled from . Key evaluation metrics include PESQ, STOI, DNSMOS, and ASR-derived WER/CER. Main results (average over –$2.0$) are summarized below:
| System | PESQ | STOI | DNSMOS | WER | CER |
|---|---|---|---|---|---|
| STFT-HiFiGAN | 2.034 | 0.894 | 2.978 | 0.112 | 0.066 |
| WavLM-HiFiGAN | 1.924 | 0.891 | 2.986 | 0.103 | 0.055 |
| Whisper-HiFiGAN | 1.200 | 0.761 | 2.897 | 0.198 | 0.332 |
| EnCodec | 1.067 | 0.574 | 2.244 | 1.156 | 1.338 |
| TSM-Net (ref) | 1.417 | 0.741 | 2.287 | 0.443 | 0.222 |
Robustness to is reflected in flat PESQ/STOI curves for STFT- and WavLM-HiFiGAN across the tested scaling range. In subjective MOS tests (four systems, six speeds, listeners):
| System | MOS |
|---|---|
| WSOLA | 4.33 |
| TSM-Net | 1.89 |
| STFT-HiFiGAN | 3.57 |
| WavLM-HiFiGAN | 4.40 |
Ablation on FiLM shows that enabling FiLM yields +0.5–0.6 PESQ and +0.02–0.03 STOI improvements at extreme (e.g., ), confirming the stabilizing role of FiLM at non-stationary or unusual stretch factors.
FiLM-conditioning enables smooth generalization across without retraining, in contrast to models lacking continuous conditioning.
7. Significance and Distinctions from Prior Art
STSM-FiLM demonstrates that direct FiLM-based conditioning on continuous playback rates enhances the generalization and flexibility of neural TSM architectures beyond what is achievable with discrete class-based or architectural approaches. By explicitly leveraging high-level learned representations and WSOLA-based supervision, STSM-FiLM achieves robust perceptual quality and intelligibility over a wide domain.
The comparative evaluation illustrates that choice of encoder–decoder stack governs the trade-off between spectral fidelity (STFT-HiFiGAN) and contextual naturalness/ASR performance (WavLM-HiFiGAN). The architecture’s reliance on feature-level resampling, FiLM-mediated adaptation, and adversarial training represents a significant convergence of classical TSM alignment principles and modern deep generative modeling (Wisnu et al., 3 Oct 2025).