Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 67 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 25 tok/s Pro
GPT-5 High 18 tok/s Pro
GPT-4o 94 tok/s Pro
Kimi K2 203 tok/s Pro
GPT OSS 120B 453 tok/s Pro
Claude Sonnet 4.5 33 tok/s Pro
2000 character limit reached

STSM-FiLM Neural Time-Scale Modification

Updated 7 October 2025
  • STSM-FiLM is a neural time-scale modification system that employs FiLM conditioning to enable continuous control over speech playback speed without altering pitch.
  • The architecture integrates various encoder-decoder variants, including STFT-HiFiGAN and WavLM-HiFiGAN, to produce high perceptual consistency across a range of speed factors.
  • Training leverages WSOLA-generated pseudo-targets along with adversarial and feature matching losses to achieve robust performance even under extreme time-stretching conditions.

STSM-FiLM is a fully neural architecture for time-scale modification (TSM) of speech, which aims to alter the playback rate of an audio signal while preserving its pitch. Classical systems such as Waveform Similarity-based Overlap-Add (WSOLA) provide baseline solutions, but tend to introduce artifacts especially under non-stationary or extreme stretching conditions. STSM-FiLM incorporates Feature-Wise Linear Modulation (FiLM) to condition the model on a continuous speed factor, enabling controllable and flexible time-scaling. The model is trained to mimic the alignment and synthesis behaviors of WSOLA by using its outputs for supervision, while benefitting from deep feature representations learned through neural networks. Four encoder–decoder configurations are evaluated: STFT-HiFiGAN, WavLM-HiFiGAN, Whisper-HiFiGAN, and EnCodec, each demonstrating high perceptual consistency over a wide range of scale factors (Wisnu et al., 3 Oct 2025).

1. Principles and Motivation of Time-Scale Modification via Neural Architectures

Time-Scale Modification (TSM) is the task of adjusting the duration of speech signals (compressing or stretching) without distorting pitch. Traditional algorithms (WSOLA, PSOLA) are based on local alignment heuristics and signal-level manipulations, which often fail under rapid or non-stationary changes due to local misalignment, phase incoherence, and discontinuities. Neural TSM methods are designed to learn representations and transformations that more robustly preserve signal structure, naturalness, and intelligibility. However, most previously published neural TSM models have limited generalization ability and lack explicit control over the playback rate, being trained only on discrete speed factors.

STSM-FiLM addresses this gap by introducing feature-wise conditioning on a continuous speed factor (α\alpha). By employing FiLM layers, the model allows direct, differentiable control over the time-scaling variable and learns to generate outputs that are both perceptually and contextually consistent over a broad range of modification rates, supervised by WSOLA as a teacher.

2. STSM-FiLM Architecture and FiLM Conditioning

STSM-FiLM follows an encoder–FiLM–decoder scheme:

  • Encoder: The input waveform is transformed to a latent space, either via STFT, neural feature models (WavLM, Whisper), or EnCodec’s own encoder.
  • FiLM Conditioning: A multilayer perceptron (MLP) maps the continuous speed factor α\alpha to affine parameters (γα,βα)(\gamma_\alpha, \beta_\alpha), which are applied as:

f^t=(1+γα)ft+βα\hat{f}_t = (1 + \gamma_\alpha) \cdot f_t + \beta_\alpha

where ftf_t is the latent feature at time tt and the $1 +$ offset ensures identity mapping at initialization.

  • Decoder: The modulated features are synthesized back to waveform. HiFi-GAN is used for STFT, WavLM, and Whisper feature sets; EnCodec uses its quantizer and decoder for direct waveform reconstruction.

FiLM layers are integrated at multiple points (typically the decoder or pre-quantization in EnCodec), ensuring dynamic modulation of internal representations without explicit sequence length alteration. Temporal duration adjustment is achieved by linear interpolation on the feature or latent code sequence proportional to α\alpha.

3. Encoder–Decoder Variants in STSM-FiLM

The framework incorporates four encoder–decoder designs:

Variant Encoder Input Decoder Architecture
STFT-HiFiGAN Log-magnitude STFT HiFi-GAN
WavLM-HiFiGAN 1024D WavLM features HiFi-GAN
Whisper-HiFiGAN Whisper encoder output HiFi-GAN
EnCodec EnCodec latent codes EnCodec quantizer+dec.
  • STFT-HiFiGAN uses spectral features from STFT.
  • WavLM-HiFiGAN utilizes self-supervised transformer features trained on speech data, with demonstrated improvement in naturalness and intelligibility.
  • Whisper-HiFiGAN leverages multilingual speech features, though is susceptible to padding artifacts.
  • EnCodec employs latent codes and quantization with FiLM-conditioned latent manipulation.

Each variant applies FiLM conditioning appropriate to its latent space, followed by temporal interpolation.

4. Training Regime and Loss Functions

  • Supervision: WSOLA outputs are generated for each (x, α\alpha) pair and used as pseudo-targets during training. WSOLA’s reliability on clean speech makes it a suitable teacher, although it can suffer under extreme stretching.
  • Objective:

L=λL1x^αxαWSOLA1+λadvLGAN+λfmLFM\mathcal{L} = \lambda_{L1} \cdot \| \hat{x}_\alpha - x_\alpha^{WSOLA} \|_1 + \lambda_{adv} \cdot \mathcal{L}_{GAN} + \lambda_{fm} \cdot \mathcal{L}_{FM}

where LGAN\mathcal{L}_{GAN} is the adversarial HiFi-GAN loss, and LFM\mathcal{L}_{FM} is a feature matching loss. Adam is used for optimization, sampling α\alpha from [0.5,2.0][0.5, 2.0] to expose the model to a full range of speed factors.

5. Evaluation: Perceptual Quality and Generalization

  • Objective Metrics: STFT-HiFiGAN yields the highest PESQ and STOI, with WavLM-HiFiGAN excelling in DNSMOS and ASR error rates. The improvement in PESQ by up to 0.6 points due to FiLM conditioning is consistent, especially at extreme α\alpha values.
  • Intelligibility and Naturalness: WavLM-HiFiGAN with FiLM conditioning leads to lower WER and CER, indicating more robust preservation of linguistic content under speed modification.
  • Subjective Results: In MOS tests, WavLM-HiFiGAN earns scores as high or slightly higher than WSOLA on English and Mandarin, confirming the utility of FiLM-based neural TSM for perceptual quality.

6. Implications, Strengths, and Limitations

  • FiLM conditioning enables smooth, granular control over α\alpha, providing dynamic adjustment without retraining or architectural change.
  • HiFi-GAN and WavLM variants demonstrate strong generalization across unseen speed factors, outperforming previous neural and classical methods especially under challenging conditions.
  • Whisper-HiFiGAN struggles with padding-induced artifacts, and EnCodec can introduce quantization artifacts at high compression rates.
  • A plausible implication is that richer contextual features (e.g., from WavLM) paired with FiLM conditioning are critical for robust neural TSM across languages and acoustic conditions.
  • Extension to tasks like voice conversion and expressive synthesis is suggested, as is further exploration of alternative conditioning mechanisms for enhanced artifact removal and control.

7. Future Research Directions

  • Integration of more robust contextual encoders to reduce artifacts, particularly in cross-lingual or noisy scenarios.
  • Refinement of FiLM-based modulation to mitigate issues specific to neural codec quantization.
  • Investigation of multi-factor conditioning (e.g., controlling both speed and pitch simultaneously), as well as self-supervised learning for TSM without reliance on classical teacher outputs.

STSM-FiLM establishes FiLM-conditioned neural architectures as competitive solutions for flexible, high-quality, and generalizable time-scale modification of speech. The system leverages classical supervision and deep learning, providing smooth control over playback speed and consistent output quality across a wide parameter space (Wisnu et al., 3 Oct 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to STSM-FILM.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube