Papers
Topics
Authors
Recent
Search
2000 character limit reached

STSM-FiLM: Neural TSM with FiLM Conditioning

Updated 3 January 2026
  • The paper introduces a neural architecture leveraging FiLM conditioning to achieve robust time-scale modification of speech while preserving pitch.
  • It employs an encoder–FiLM–decoder paradigm with four encoder–decoder variants, validated by metrics such as PESQ and STOI for perceptual quality.
  • FiLM conditioning enables continuous control of playback speed across a wide range of α, outperforming traditional WSOLA methods under non-stationary conditions.

STSM-FiLM is a fully neural architecture for time-scale modification (TSM) of speech, designed to alter the playback rate of audio without affecting pitch. It leverages Feature-Wise Linear Modulation (FiLM) as a continuous conditioning mechanism on the speed factor, aiming to outperform classical approaches such as Waveform Similarity-based Overlap-Add (WSOLA), especially under non-stationary or extreme time-scaling. STSM-FiLM employs an encoder–FiLM–decoder paradigm, supervised on WSOLA-generated pseudo-ground truth, and supports four distinct encoder–decoder variants, enabling robust generalization and high perceptual consistency across a wide range of time-scaling factors (Wisnu et al., 3 Oct 2025).

1. Neural Architecture and Workflow

The STSM-FiLM architecture follows an explicit encoder–conditioning–decoder workflow: x(t)Encoder{ft}FiLM(;α){ft}temporal interpolation{f~t}Decodery^α(t)x(t) \longrightarrow \text{Encoder} \longrightarrow \{f_t\} \xrightarrow{\text{FiLM}(\cdot ; \alpha)} \{f'_t\} \xrightarrow{\text{temporal interpolation}} \{\tilde{f}_t\} \longrightarrow \text{Decoder} \longrightarrow \hat{y}^\alpha(t) where x(t)x(t) is the input waveform, α>0\alpha > 0 is the speed factor, {ft}\{f_t\} are intermediate features, and y^α(t)\hat{y}^\alpha(t) is the TSM output.

The process comprises:

  • Encoding: Input waveforms are projected to high-level features. The choice of encoder and decoder varies by system variant.
  • FiLM Conditioning: Features are modulated via FiLM layers parametrized by α\alpha.
  • Temporal Interpolation: The conditioned feature sequence is interpolated in time to achieve the target length TααTT^\alpha \approx \lceil\alpha \cdot T\rceil.
  • Decoding: The modified sequence is reconstructed into a time-scaled waveform.

Ground-truth targets y^α(t)\hat{y}^\alpha(t) are constructed by applying WSOLA to the input: xwsola,α(t)=WSOLA[x(t),α]x^{\text{wsola}, \alpha}(t) = \operatorname{WSOLA}[x(t), \alpha].

2. Encoder–Decoder Variants

Four encoder–decoder configurations are supported, each reflecting different feature priors and reconstruction strategies:

System Encoder Features Decoder
STFT-HiFiGAN ft=logSTFT(x)f_t = \log|\operatorname{STFT}(x)| ($1024$-dim log-mel) HiFi-GAN vocoder
WavLM-HiFiGAN Layer 6 WavLM-Large ($1024$-dim) HiFi-GAN
Whisper-HiFiGAN Last layer Whisper-Medium ($1024$-dim) HiFi-GAN
EnCodec EnCodec quantized codes hth_t EnCodec decoder
  • In STFT-, WavLM-, and Whisper-based systems, FiLM layers are applied after each convolutional block in the HiFi-GAN generator.
  • In EnCodec, FiLM is injected prior to quantization, and interpolation operates in latent space before decoding.

These designs enable trade-offs in fidelity and robustness, with spectral encoders (STFT) providing highest SNR and contextual encoders (WavLM) maximizing perceptual naturalness and ASR performance.

3. FiLM Conditioning Mechanism

FiLM conditioning enables continuous control over the time-scaling parameter:

  • The speed factor α\alpha is mapped to affine modulation parameters (γα,βα)RC×1(\gamma_\alpha, \beta_\alpha) \in \mathbb{R}^{C \times 1} via a small MLP.
  • For each channel cc, feature-wise linear modulation is performed:

ft(c)=(1+γα(c))ft(c)+βα(c)f'_t(c) = (1 + \gamma_\alpha(c)) \cdot f_t(c) + \beta_\alpha(c)

  • The additive identity (“+1+1” on γα\gamma_\alpha) ensures stable training and preserves the initial representation.

This conditioning is realized across all FiLM layers in the generator. In EnCodec, FiLM is applied directly to hth_t (pre-quantization). The ability to interpolate α\alpha facilitates seamless generalization over a continuous range of speed factors without requiring separate models.

4. Training Objectives and Supervision

Supervision is entirely pseudo-grounded on WSOLA targets. The overall loss comprises:

L=λ1y^αxwsola,α1+λADVLGAN(y^α)+λFMLFM(y^α,xwsola,α)\mathcal{L} = \lambda_1 \|\hat{y}^\alpha - x^{\text{wsola}, \alpha}\|_1 + \lambda_{\text{ADV}} \mathcal{L}_{\text{GAN}}(\hat{y}^{\alpha}) + \lambda_{\text{FM}} \mathcal{L}_{\text{FM}}(\hat{y}^\alpha, x^{\text{wsola}, \alpha})

  • LGAN\mathcal{L}_{\text{GAN}} is a HiFi-GAN style adversarial loss:

LGAN=Ex[logD(xwsola,α)]+Ex[log(1D(y^α))]\mathcal{L}_{\text{GAN}} = \mathbb{E}_x[\log D(x^{\text{wsola}, \alpha})] + \mathbb{E}_x[\log(1 - D(\hat{y}^{\alpha}))]

  • LFM\mathcal{L}_{\text{FM}} is a feature-matching loss over internal discriminator features:

LFM=Ex[lDl(xwsola,α)Dl(y^α)1]\mathcal{L}_{\text{FM}} = \mathbb{E}_x\left[ \sum_l \|D_l(x^{\text{wsola}, \alpha}) - D_l(\hat{y}^{\alpha})\|_1 \right]

No explicit alignment or regularization terms are required beyond these objectives.

5. Time-Scale Modification and Signal Processing

Time-scale modification is parameterized by α\alpha:

  • α>1\alpha > 1 induces time-expansion (slower playback).
  • α<1\alpha < 1 indicates time-compression (faster playback).

Resampling in the neural feature domain proceeds as linear interpolation from TT to TαT^\alpha along the time axis: f~=Interp({ft},Tα)\tilde{f} = \operatorname{Interp}(\{f'_t\}, T^\alpha) This resampled sequence is decoded back to waveform using the respective decoder. The pipeline thereby mimics classical WSOLA alignment/synthesis with deep, learned representations.

6. Empirical Evaluation and Generalization

STSM-FiLM is trained on VCTK (English) and TMHINT-QI (Mandarin) at 16 kHz, with α\alpha sampled from [0.5,2.0][0.5, 2.0]. Key evaluation metrics include PESQ, STOI, DNSMOS, and ASR-derived WER/CER. Main results (average over α=0.5\alpha=0.5–$2.0$) are summarized below:

System PESQ STOI DNSMOS WER CER
STFT-HiFiGAN 2.034 0.894 2.978 0.112 0.066
WavLM-HiFiGAN 1.924 0.891 2.986 0.103 0.055
Whisper-HiFiGAN 1.200 0.761 2.897 0.198 0.332
EnCodec 1.067 0.574 2.244 1.156 1.338
TSM-Net (ref) 1.417 0.741 2.287 0.443 0.222

Robustness to α\alpha is reflected in flat PESQ/STOI curves for STFT- and WavLM-HiFiGAN across the tested scaling range. In subjective MOS tests (four systems, six speeds, n=25n=25 listeners):

System MOS
WSOLA 4.33
TSM-Net 1.89
STFT-HiFiGAN 3.57
WavLM-HiFiGAN 4.40

Ablation on FiLM shows that enabling FiLM yields +0.5–0.6 PESQ and +0.02–0.03 STOI improvements at extreme α\alpha (e.g., α=0.7,1.5\alpha=0.7, 1.5), confirming the stabilizing role of FiLM at non-stationary or unusual stretch factors.

FiLM-conditioning enables smooth generalization across α\alpha without retraining, in contrast to models lacking continuous conditioning.

7. Significance and Distinctions from Prior Art

STSM-FiLM demonstrates that direct FiLM-based conditioning on continuous playback rates enhances the generalization and flexibility of neural TSM architectures beyond what is achievable with discrete class-based or architectural approaches. By explicitly leveraging high-level learned representations and WSOLA-based supervision, STSM-FiLM achieves robust perceptual quality and intelligibility over a wide α\alpha domain.

The comparative evaluation illustrates that choice of encoder–decoder stack governs the trade-off between spectral fidelity (STFT-HiFiGAN) and contextual naturalness/ASR performance (WavLM-HiFiGAN). The architecture’s reliance on feature-level resampling, FiLM-mediated adaptation, and adversarial training represents a significant convergence of classical TSM alignment principles and modern deep generative modeling (Wisnu et al., 3 Oct 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to STSM-FiLM.