STSM-FiLM: Neural Time-Scale Modification
- Time-Scale Modification (STSM-FiLM) is a technique that adjusts the duration of audio signals using FiLM-based neural modulation while preserving pitch and clarity.
- It employs an encoder–FiLM–decoder pipeline with temporal interpolation and adversarial training, outperforming classical methods like PSOLA and WSOLA in metrics such as PESQ and STOI.
- Despite minor artifacts under extreme scaling, STSM-FiLM supports real-time processing and robust handling of non-stationary inputs, offering enhanced audio synthesis quality.
Time-Scale Modification (STSM-FiLM) encompasses a collection of neural and mathematical techniques for modifying the duration of audio signals—most notably speech—without altering their pitch or degrading perceptual intelligibility. The term covers both classical, algorithmic approaches and recent deep learning architectures employing Feature-Wise Linear Modulation (FiLM) for continuous time-scaling in a fully differentiable, end-to-end pipeline. STSM-FiLM systems are evaluated on standard metrics (PESQ, STOI, DNSMOS, WER/CER), outperforming traditional methods such as PSOLA and WSOLA, particularly under extreme time-stretch or non-stationary input handling (Wisnu et al., 3 Oct 2025, 0911.5171, Chu et al., 2022).
1. Time-Scale Modification: Definition and Classical Constraints
Time-Scale Modification (TSM) refers to the operation for a given signal , generating an output whose duration is while maintaining pitch and intelligibility. Classical time-domain methods—PSOLA and WSOLA—manipulate short, overlapping frames and apply either pitch-synchronous slicing (PSOLA) or waveform-similarity overlap-adding (WSOLA) for alignment and stretching.
| Method | Principle | Key Limitation |
|---|---|---|
| PSOLA | Pitch-synchronous frame OLA | Formant drift, phasing under non-integer harmonics |
| WSOLA | Frame overlap by cross-correlation | Transient smearing, extreme factor artifacts |
| Phase-Vocoder | Spectral bin phase/magnitude modification | "Phasiness," poor transient reproduction |
These classical approaches rely on hand-crafted heuristics for segment detection and alignment, which do not generalize smoothly to arbitrary scale factors , especially in non-stationary or highly transient speech (Wisnu et al., 3 Oct 2025).
2. The STSM-FiLM Neural Architecture
STSM-FiLM introduces a deep neural framework conditioned on the stretching/compressing parameter by FiLM. The pipeline comprises:
- Encoder : Projects waveform to a frame sequence , e.g., STFT spectrograms, transformer latents, or codec codebooks.
- FiLM Module: A small MLP generates affine modulation parameters specific to scale factor . For each frame: .
- Temporal Interpolation: Linearly resamples modulated feature sequence to target length for desired output duration.
- Decoder : Reconstructs waveform from modulated/interpolated features.
Several encoder–decoder variants are implemented:
- STFT-HiFiGAN: 1024-dim log-mel spectrograms subjected to FiLM injection within HiFi-GAN.
- WavLM-HiFiGAN: 1024-dim latent features from WavLM transformer encoding, processed by HiFi-GAN.
- Whisper-HiFiGAN: Whisper-Medium encoder final-layer latents as FiLM input.
- EnCodec: EnCodec’s quantizer and decoder, with FiLM after the encoder and temporal interpolation performed on latent codes.
FiLM conditioning enables channel-wise adaptive modulation responsive to , supporting arbitrary continuous speed factors without retraining and outperforming naïve concatenation strategies by dynamically scaling and shifting signal representation channels (Wisnu et al., 3 Oct 2025).
3. Training Procedures and Loss Functions
STSM-FiLM is supervised using outputs generated by classical WSOLA for each input/speed factor pair, thereby inheriting segmental alignment properties. The multi-component training objective is:
where penalizes sample-wise reconstruction error, applies adversarial discrimination (multi-scale), and matches discriminator feature activations. Typical weights are , , (Wisnu et al., 3 Oct 2025).
4. Objective and Subjective Evaluation
Performance is rigorously quantified using metrics including PESQ (Perceptual Evaluation of Speech Quality), STOI (Short-Time Objective Intelligibility), DNSMOS, WER (Word Error Rate), and CER (Character Error Rate) across diverse . Average results for principal variants are:
| System | PESQ | STOI | DNSMOS | WER | CER |
|---|---|---|---|---|---|
| STFT-HiFiGAN | 2.034 | 0.894 | 2.978 | 0.112 | 0.066 |
| WavLM-HiFiGAN | 1.924 | 0.891 | 2.986 | 0.103 | 0.055 |
| Whisper-HiFiGAN | 1.200 | 0.761 | 2.897 | 0.198 | 0.332 |
| EnCodec | 1.067 | 0.574 | 2.244 | 1.156 | 1.338 |
| TSM-Net (baseline) | 1.417 | 0.741 | 2.287 | 0.443 | 0.222 |
FiLM ablation demonstrates substantial perceptual improvement (ΔPESQ ≈ 0.5) at challenging stretch/compression values, particularly on WavLM-HiFiGAN (Wisnu et al., 3 Oct 2025). Subjective MOS scores also indicate listener preference for FiLM-based models over the classical baseline, with WavLM-HiFiGAN marginally exceeding WSOLA despite the latter serving as training teacher.
5. Mathematical Model and Theoretical Properties
Classical STSM can be formalized as a cylinder-embedding where maps envelope time and cyclic phase. The core operators are:
- Cylinder Embedding (): For a kernel ,
- Resampling Operator ():
Guarantees include time-invariance, linearity, envelope preservation (for suitable ), robustness to non-harmonic frequencies, and exact inclusion of plain resampling (). Streaming discrete implementations require only per-sample constant work and minimal memory, lending themselves to efficient audio synthesis and real-time transformation (0911.5171).
6. Algorithmic and Computational Aspects
Discrete streaming algorithms implement two 1D interpolations along phase and time, facilitating rapid real-time synthesis and minimal latency. The computational cost is dominated by simple floating-point operations—typically three lerps, four memory fetches per output sample, with buffering of only a small window of input samples.
FiLM-based neural architectures retain low computational overhead in the modulation layer, with most computational cost in encoder–decoder forward passes. Classical implementations outperform STFT-based vocoders in latency and memory use, while learned variants surpass these in flexibility and continuous generalization (Wisnu et al., 3 Oct 2025, Chu et al., 2022).
7. Limitations and Future Opportunities
FiLM-based STSM systems exhibit artifacts associated with linear temporal interpolation, particularly under abrupt or extreme time-scaling. Specific observed limitations include quantization artifacts in EnCodec variant and padding-induced artifacts in Whisper-HiFiGAN. All models currently depend on fixed linear feature-length resampling, which can introduce minor glitches.
Potential research directions include:
- Development of learnable time-alignment modules (e.g., attention-based warping).
- Joint multi-task losses targeting pitch preservation and prosody consistency.
- Investigation of alternative FiLM injection schedules (encoder vs. decoder).
- Robust pre-training for noisy or reverberant speech conditions.
Advanced mathematical models support theoretically exact envelope and phase manipulation, envelope-preservation under up to 50× scaling/compression, and seamless wave-shape looping, with established superiority over PSOLA/WSOLA in artifact reduction for non-speech signals and steady-state tones (0911.5171).
8. Relation to Contemporary Neural Time-Scale Modification
Recent neural TSM systems such as TSM-Net apply a high-compression autoencoder ("Neuralgram") for frame-free, latent space temporal stretching, using cubic interpolation and GAN-based learning objectives. TSM-Net achieves comparable MOS and objective scores to classical methods, with reduced computational cost and real-time operation (Chu et al., 2022). The approach is distinct from FiLM-conditioned STSM as it relies on fixed spatial interpolation of compressed latent vectors, with future work oriented toward learned super-resolution and adaptive bandwise compression.
In sum, STSM-FiLM synthesizes the strengths of classical alignment algorithms with deep representational learning and adaptive continuous modulation, establishing a new standard for fine-grained, artifact-robust time-scale modification in speech and audio processing (Wisnu et al., 3 Oct 2025, 0911.5171, Chu et al., 2022).