STSM-FiLM Neural Time-Scale Modification
- STSM-FiLM is a neural time-scale modification system that employs FiLM conditioning to enable continuous control over speech playback speed without altering pitch.
- The architecture integrates various encoder-decoder variants, including STFT-HiFiGAN and WavLM-HiFiGAN, to produce high perceptual consistency across a range of speed factors.
- Training leverages WSOLA-generated pseudo-targets along with adversarial and feature matching losses to achieve robust performance even under extreme time-stretching conditions.
STSM-FiLM is a fully neural architecture for time-scale modification (TSM) of speech, which aims to alter the playback rate of an audio signal while preserving its pitch. Classical systems such as Waveform Similarity-based Overlap-Add (WSOLA) provide baseline solutions, but tend to introduce artifacts especially under non-stationary or extreme stretching conditions. STSM-FiLM incorporates Feature-Wise Linear Modulation (FiLM) to condition the model on a continuous speed factor, enabling controllable and flexible time-scaling. The model is trained to mimic the alignment and synthesis behaviors of WSOLA by using its outputs for supervision, while benefitting from deep feature representations learned through neural networks. Four encoder–decoder configurations are evaluated: STFT-HiFiGAN, WavLM-HiFiGAN, Whisper-HiFiGAN, and EnCodec, each demonstrating high perceptual consistency over a wide range of scale factors (Wisnu et al., 3 Oct 2025).
1. Principles and Motivation of Time-Scale Modification via Neural Architectures
Time-Scale Modification (TSM) is the task of adjusting the duration of speech signals (compressing or stretching) without distorting pitch. Traditional algorithms (WSOLA, PSOLA) are based on local alignment heuristics and signal-level manipulations, which often fail under rapid or non-stationary changes due to local misalignment, phase incoherence, and discontinuities. Neural TSM methods are designed to learn representations and transformations that more robustly preserve signal structure, naturalness, and intelligibility. However, most previously published neural TSM models have limited generalization ability and lack explicit control over the playback rate, being trained only on discrete speed factors.
STSM-FiLM addresses this gap by introducing feature-wise conditioning on a continuous speed factor (). By employing FiLM layers, the model allows direct, differentiable control over the time-scaling variable and learns to generate outputs that are both perceptually and contextually consistent over a broad range of modification rates, supervised by WSOLA as a teacher.
2. STSM-FiLM Architecture and FiLM Conditioning
STSM-FiLM follows an encoder–FiLM–decoder scheme:
- Encoder: The input waveform is transformed to a latent space, either via STFT, neural feature models (WavLM, Whisper), or EnCodec’s own encoder.
- FiLM Conditioning: A multilayer perceptron (MLP) maps the continuous speed factor to affine parameters , which are applied as:
where is the latent feature at time and the $1 +$ offset ensures identity mapping at initialization.
- Decoder: The modulated features are synthesized back to waveform. HiFi-GAN is used for STFT, WavLM, and Whisper feature sets; EnCodec uses its quantizer and decoder for direct waveform reconstruction.
FiLM layers are integrated at multiple points (typically the decoder or pre-quantization in EnCodec), ensuring dynamic modulation of internal representations without explicit sequence length alteration. Temporal duration adjustment is achieved by linear interpolation on the feature or latent code sequence proportional to .
3. Encoder–Decoder Variants in STSM-FiLM
The framework incorporates four encoder–decoder designs:
Variant | Encoder Input | Decoder Architecture |
---|---|---|
STFT-HiFiGAN | Log-magnitude STFT | HiFi-GAN |
WavLM-HiFiGAN | 1024D WavLM features | HiFi-GAN |
Whisper-HiFiGAN | Whisper encoder output | HiFi-GAN |
EnCodec | EnCodec latent codes | EnCodec quantizer+dec. |
- STFT-HiFiGAN uses spectral features from STFT.
- WavLM-HiFiGAN utilizes self-supervised transformer features trained on speech data, with demonstrated improvement in naturalness and intelligibility.
- Whisper-HiFiGAN leverages multilingual speech features, though is susceptible to padding artifacts.
- EnCodec employs latent codes and quantization with FiLM-conditioned latent manipulation.
Each variant applies FiLM conditioning appropriate to its latent space, followed by temporal interpolation.
4. Training Regime and Loss Functions
- Supervision: WSOLA outputs are generated for each (x, ) pair and used as pseudo-targets during training. WSOLA’s reliability on clean speech makes it a suitable teacher, although it can suffer under extreme stretching.
- Objective:
where is the adversarial HiFi-GAN loss, and is a feature matching loss. Adam is used for optimization, sampling from to expose the model to a full range of speed factors.
5. Evaluation: Perceptual Quality and Generalization
- Objective Metrics: STFT-HiFiGAN yields the highest PESQ and STOI, with WavLM-HiFiGAN excelling in DNSMOS and ASR error rates. The improvement in PESQ by up to 0.6 points due to FiLM conditioning is consistent, especially at extreme values.
- Intelligibility and Naturalness: WavLM-HiFiGAN with FiLM conditioning leads to lower WER and CER, indicating more robust preservation of linguistic content under speed modification.
- Subjective Results: In MOS tests, WavLM-HiFiGAN earns scores as high or slightly higher than WSOLA on English and Mandarin, confirming the utility of FiLM-based neural TSM for perceptual quality.
6. Implications, Strengths, and Limitations
- FiLM conditioning enables smooth, granular control over , providing dynamic adjustment without retraining or architectural change.
- HiFi-GAN and WavLM variants demonstrate strong generalization across unseen speed factors, outperforming previous neural and classical methods especially under challenging conditions.
- Whisper-HiFiGAN struggles with padding-induced artifacts, and EnCodec can introduce quantization artifacts at high compression rates.
- A plausible implication is that richer contextual features (e.g., from WavLM) paired with FiLM conditioning are critical for robust neural TSM across languages and acoustic conditions.
- Extension to tasks like voice conversion and expressive synthesis is suggested, as is further exploration of alternative conditioning mechanisms for enhanced artifact removal and control.
7. Future Research Directions
- Integration of more robust contextual encoders to reduce artifacts, particularly in cross-lingual or noisy scenarios.
- Refinement of FiLM-based modulation to mitigate issues specific to neural codec quantization.
- Investigation of multi-factor conditioning (e.g., controlling both speed and pitch simultaneously), as well as self-supervised learning for TSM without reliance on classical teacher outputs.
STSM-FiLM establishes FiLM-conditioned neural architectures as competitive solutions for flexible, high-quality, and generalizable time-scale modification of speech. The system leverages classical supervision and deep learning, providing smooth control over playback speed and consistent output quality across a wide parameter space (Wisnu et al., 3 Oct 2025).