STSM-FiLM: Neural TSM with FiLM Conditioning

Updated 3 January 2026

The paper introduces a neural architecture leveraging FiLM conditioning to achieve robust time-scale modification of speech while preserving pitch.
It employs an encoder–FiLM–decoder paradigm with four encoder–decoder variants, validated by metrics such as PESQ and STOI for perceptual quality.
FiLM conditioning enables continuous control of playback speed across a wide range of α, outperforming traditional WSOLA methods under non-stationary conditions.

STSM-FiLM is a fully neural architecture for time-scale modification (TSM) of speech, designed to alter the playback rate of audio without affecting pitch. It leverages Feature-Wise Linear Modulation (FiLM) as a continuous conditioning mechanism on the speed factor, aiming to outperform classical approaches such as Waveform Similarity-based Overlap-Add (WSOLA), especially under non-stationary or extreme time-scaling. STSM-FiLM employs an encoder–FiLM–decoder paradigm, supervised on WSOLA-generated pseudo-ground truth, and supports four distinct encoder–decoder variants, enabling robust generalization and high perceptual consistency across a wide range of time-scaling factors (Wisnu et al., 3 Oct 2025).

1. Neural Architecture and Workflow

The STSM-FiLM architecture follows an explicit encoder–conditioning–decoder workflow: $x(t) \longrightarrow \text{Encoder} \longrightarrow \{f_t\} \xrightarrow{\text{FiLM}(\cdot ; \alpha)} \{f'_t\} \xrightarrow{\text{temporal interpolation}} \{\tilde{f}_t\} \longrightarrow \text{Decoder} \longrightarrow \hat{y}^\alpha(t)$ where $x(t)$ is the input waveform, $\alpha > 0$ is the speed factor, $\{f_t\}$ are intermediate features, and $\hat{y}^\alpha(t)$ is the TSM output.

The process comprises:

Encoding: Input waveforms are projected to high-level features. The choice of encoder and decoder varies by system variant.
FiLM Conditioning: Features are modulated via FiLM layers parametrized by $\alpha$ .
Temporal Interpolation: The conditioned feature sequence is interpolated in time to achieve the target length $T^\alpha \approx \lceil\alpha \cdot T\rceil$ .
Decoding: The modified sequence is reconstructed into a time-scaled waveform.

Ground-truth targets $\hat{y}^\alpha(t)$ are constructed by applying WSOLA to the input: $x^{\text{wsola}, \alpha}(t) = \operatorname{WSOLA}[x(t), \alpha]$ .

2. Encoder–Decoder Variants

Four encoder–decoder configurations are supported, each reflecting different feature priors and reconstruction strategies:

System	Encoder Features	Decoder
STFT-HiFiGAN	$f_t = \log\|\operatorname{STFT}(x)\|$ ($1024$-dim log-mel)	HiFi-GAN vocoder
WavLM-HiFiGAN	Layer 6 WavLM-Large ($1024$-dim)	HiFi-GAN
Whisper-HiFiGAN	Last layer Whisper-Medium ($1024$-dim)	HiFi-GAN
EnCodec	EnCodec quantized codes $h_t$	EnCodec decoder

In STFT-, WavLM-, and Whisper-based systems, FiLM layers are applied after each convolutional block in the HiFi-GAN generator.
In EnCodec, FiLM is injected prior to quantization, and interpolation operates in latent space before decoding.

These designs enable trade-offs in fidelity and robustness, with spectral encoders (STFT) providing highest SNR and contextual encoders (WavLM) maximizing perceptual naturalness and ASR performance.

3. FiLM Conditioning Mechanism

FiLM conditioning enables continuous control over the time-scaling parameter:

The speed factor $\alpha$ is mapped to affine modulation parameters $(\gamma_\alpha, \beta_\alpha) \in \mathbb{R}^{C \times 1}$ via a small MLP.
For each channel $c$ , feature-wise linear modulation is performed:

$f'_t(c) = (1 + \gamma_\alpha(c)) \cdot f_t(c) + \beta_\alpha(c)$

The additive identity (“ $+1$ ” on $\gamma_\alpha$ ) ensures stable training and preserves the initial representation.

This conditioning is realized across all FiLM layers in the generator. In EnCodec, FiLM is applied directly to $h_t$ (pre-quantization). The ability to interpolate $\alpha$ facilitates seamless generalization over a continuous range of speed factors without requiring separate models.

4. Training Objectives and Supervision

Supervision is entirely pseudo-grounded on WSOLA targets. The overall loss comprises:

$\mathcal{L} = \lambda_1 \|\hat{y}^\alpha - x^{\text{wsola}, \alpha}\|_1 + \lambda_{\text{ADV}} \mathcal{L}_{\text{GAN}}(\hat{y}^{\alpha}) + \lambda_{\text{FM}} \mathcal{L}_{\text{FM}}(\hat{y}^\alpha, x^{\text{wsola}, \alpha})$

$\mathcal{L}_{\text{GAN}}$ is a HiFi-GAN style adversarial loss:

$\mathcal{L}_{\text{GAN}} = \mathbb{E}_x[\log D(x^{\text{wsola}, \alpha})] + \mathbb{E}_x[\log(1 - D(\hat{y}^{\alpha}))]$

$\mathcal{L}_{\text{FM}}$ is a feature-matching loss over internal discriminator features:

$\mathcal{L}_{\text{FM}} = \mathbb{E}_x\left[ \sum_l \|D_l(x^{\text{wsola}, \alpha}) - D_l(\hat{y}^{\alpha})\|_1 \right]$

No explicit alignment or regularization terms are required beyond these objectives.

5. Time-Scale Modification and Signal Processing

Time-scale modification is parameterized by $\alpha$ :

$\alpha > 1$ induces time-expansion (slower playback).
$\alpha < 1$ indicates time-compression (faster playback).

Resampling in the neural feature domain proceeds as linear interpolation from $T$ to $T^\alpha$ along the time axis: $\tilde{f} = \operatorname{Interp}(\{f'_t\}, T^\alpha)$ This resampled sequence is decoded back to waveform using the respective decoder. The pipeline thereby mimics classical WSOLA alignment/synthesis with deep, learned representations.

6. Empirical Evaluation and Generalization

STSM-FiLM is trained on VCTK (English) and TMHINT-QI (Mandarin) at 16 kHz, with $\alpha$ sampled from $[0.5, 2.0]$ . Key evaluation metrics include PESQ, STOI, DNSMOS, and ASR-derived WER/CER. Main results (average over $\alpha=0.5$ –$2.0$) are summarized below:

System	PESQ	STOI	DNSMOS	WER	CER
STFT-HiFiGAN	2.034	0.894	2.978	0.112	0.066
WavLM-HiFiGAN	1.924	0.891	2.986	0.103	0.055
Whisper-HiFiGAN	1.200	0.761	2.897	0.198	0.332
EnCodec	1.067	0.574	2.244	1.156	1.338
TSM-Net (ref)	1.417	0.741	2.287	0.443	0.222

Robustness to $\alpha$ is reflected in flat PESQ/STOI curves for STFT- and WavLM-HiFiGAN across the tested scaling range. In subjective MOS tests (four systems, six speeds, $n=25$ listeners):

System	MOS
WSOLA	4.33
TSM-Net	1.89
STFT-HiFiGAN	3.57
WavLM-HiFiGAN	4.40

Ablation on FiLM shows that enabling FiLM yields +0.5–0.6 PESQ and +0.02–0.03 STOI improvements at extreme $\alpha$ (e.g., $\alpha=0.7, 1.5$ ), confirming the stabilizing role of FiLM at non-stationary or unusual stretch factors.

FiLM-conditioning enables smooth generalization across $\alpha$ without retraining, in contrast to models lacking continuous conditioning.

7. Significance and Distinctions from Prior Art

STSM-FiLM demonstrates that direct FiLM-based conditioning on continuous playback rates enhances the generalization and flexibility of neural TSM architectures beyond what is achievable with discrete class-based or architectural approaches. By explicitly leveraging high-level learned representations and WSOLA-based supervision, STSM-FiLM achieves robust perceptual quality and intelligibility over a wide $\alpha$ domain.

The comparative evaluation illustrates that choice of encoder–decoder stack governs the trade-off between spectral fidelity (STFT-HiFiGAN) and contextual naturalness/ASR performance (WavLM-HiFiGAN). The architecture’s reliance on feature-level resampling, FiLM-mediated adaptation, and adversarial training represents a significant convergence of classical TSM alignment principles and modern deep generative modeling (Wisnu et al., 3 Oct 2025).

Markdown Report Issue Upgrade to Chat

References (1)

STSM-FiLM: A FiLM-Conditioned Neural Architecture for Time-Scale Modification of Speech (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to STSM-FiLM.

STSM-FiLM: Neural TSM with FiLM Conditioning

1. Neural Architecture and Workflow

2. Encoder–Decoder Variants

3. FiLM Conditioning Mechanism

4. Training Objectives and Supervision

5. Time-Scale Modification and Signal Processing

6. Empirical Evaluation and Generalization

7. Significance and Distinctions from Prior Art

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

STSM-FiLM: Neural TSM with FiLM Conditioning

1. Neural Architecture and Workflow

2. Encoder–Decoder Variants

3. FiLM Conditioning Mechanism

4. Training Objectives and Supervision

5. Time-Scale Modification and Signal Processing

6. Empirical Evaluation and Generalization

7. Significance and Distinctions from Prior Art

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research