Recurrent FiLM Generators for Sequence Modeling

Updated 12 March 2026

Recurrent FiLM generators are architectural modules that dynamically modulate CNN activations using RNN-produced scaling and shifting parameters, efficiently capturing long-range dependencies in sequential data such as text, audio, or genomic sequences.
They integrate a convolutional backbone with a recurrent network to generate adaptive FiLM parameters, offering improved performance over static modulations and deep pure-CNNs, while maintaining computational efficiency.
Empirical evaluations demonstrate that recurrent FiLM generators enhance accuracy in text classification, boost audio super-resolution quality, and reduce perplexity in language modeling with modest computational overhead.

Recurrent FiLM generators are architectural components designed to dynamically modulate convolutional neural network (CNN) activations through feature-wise linear modulation (FiLM) parameters produced by a recurrent neural network (RNN). This construction, exemplified by the Temporal FiLM (TFiLM) module, enables efficient capture of long-range dependencies in sequential data such as text, audio, or genomic sequences by allowing information from prior time steps to influence the current convolutional activations via learned, adaptive scaling and shifting coefficients (Birnbaum et al., 2019).

1. High-Level Data Flow and Architectural Overview

A recurrent FiLM generator processes a sequence of inputs $\{x_1,\dots,x_T\}$ . A convolutional backbone (typically 1D convolutions with dilation or pooling) ingests a windowed subset of recent inputs at each time step $t$ , producing a feature map $h_t \in \mathbb{R}^{C \times L}$ , where $C$ is the number of feature channels and $L$ is the spatial or temporal extent. In parallel, an RNN (e.g., gated recurrent unit (GRU) or long short-term memory (LSTM)) maintains a hidden state $s_t \in \mathbb{R}^{H}$ that evolves over time.

At each time step, the RNN consumes a summary statistic of the convolutional output (such as global average pooling over $h_t$ ) or a direct embedding of $x_t$ (or both) as input $z_t$ , updating its hidden state:

$s_t = \mathrm{GRU}(s_{t-1}, z_t)$

The RNN then predicts per-channel FiLM scale and shift parameters $(\gamma_t,\beta_t) \in \mathbb{R}^{C} \times \mathbb{R}^{C}$ via a linear projection:

$(\gamma_t, \beta_t) = W_{\phi} s_t + b_{\phi},\quad W_{\phi} \in \mathbb{R}^{2C \times H},\ b_{\phi} \in \mathbb{R}^{2C}$

These coefficients modulate the convolutional map as:

$\forall\, c \in \{1,\dots,C\},\ i\in\{1,\dots,L\}: \quad \hat h_{t,c,i} = \gamma_{t,c} h_{t,c,i} + \beta_{t,c}$

This modulated map $\hat h_t$ is forwarded to subsequent convolutional layers, classifiers, or decoders. The RNN’s temporal dynamics allow arbitrarily long-range dependencies to be encoded into the feature-wise modulations of the CNN, outperforming pure feed-forward convolutions (with bounded receptive fields) and offering substantial computational advantages compared to deep recurrent stacks.

2. Mathematical Formulation

At time $t$ , the system can be formalized as:

Feature extraction via convolution:

$h_t = \mathrm{conv\_block}(x_{t-k+1:t}) \in \mathbb{R}^{C \times L}$

RNN update (with $z_t$ a function of $h_t$ or $x_t$ ):

$s_t = f_{\mathrm{RNN}}(s_{t-1}, z_t)$

where $z_t = \mathrm{mean}_i h_{t,:,i} \in \mathbb{R}^C$ , then projected to $\mathbb{R}^H$ .

FiLM parameter generation:

$(\gamma_t, \beta_t) = W_\phi s_t + b_\phi$

Feature-wise modulation:

$\hat h_{t, c, i} = \gamma_{t,c} h_{t,c,i} + \beta_{t,c}$

Optionally, the modulated feature map is further processed (e.g., by passing through additional convolutions or non-linearity) or used for prediction.

3. Implementation Considerations

Key architectural decisions and optimizations include:

RNN Choices: Single-layer GRU with $H=256$ hidden units is typical; LSTM with $H=128$ –$512$ also viable. Input $z_t$ may concatenate global-pooled $h_t$ (dimension $C$ ) and embeddings of $x_t$ , projected via affine layers.
Integration Points: Commonly, a TFiLM layer follows every convolutional block; for lightweight variants, only the terminal block is modulated.
Computational Cost: The combined cost of recurrent and linear projections scales as $\mathcal{O}(T H^2 + T H C)$ , yielding modest overhead for $H\sim 256$ , $C\sim 64$ , and sequence length $T$ in the thousands. The unbounded effective receptive field, provided by recurrence, contrasts sharply with the depth-limited field of pure CNNs.
Stability and Optimization: Training employs Adam (learning rate $10^{-3}$ ) or SGD with momentum. RNN gradients are clipped ( $\|g\|_2 \leq 5$ ), and stabilization is enhanced via weight normalization on $W_\phi$ and layer normalization inside the RNN. Dropout ( $p=0.1$ –$0.3$) is applied to RNN inputs and feature maps.

4. Empirical Evaluation

Performance of recurrent FiLM generators was assessed on classification, regression, and sequence modeling tasks:

Text classification (Yelp, AG News, DBpedia): A 4-block dilated CNN baseline achieves $\sim$ 88% accuracy. Static FiLM (parameters predicted once from the first token) yields $\sim$ 89%, whereas TFiLM with a GRU-generator attains $\sim$ 90.5%, matching/exceeding much deeper pure-CNN or pure-RNN networks. Freezing the FiLM parameters reduces accuracy by $\sim$ 1.2% absolute.
Audio super-resolution (×4 upsampling at 16kHz): A pure CNN achieves 19 dB SNR; static FiLM improves this to 19.3 dB. TFiLM further raises SNR to 21 dB and exhibits improved high-frequency synthesis.
Language modeling (Penn Treebank): TFiLM-enhanced CNNs outperform comparable 1D-CNNs by $\sim$ 0.5 perplexity, closely matching a 2-layer LSTM but with reduced parameter count. More than 2 RNN layers yields negligible gains.

For all tasks, TFiLM induces a computational slowdown of $\sim$ 1.1× relative to the base CNN, but remains $2$–$3$ times faster than deep RNNs processing the full sequence.

5. Advantages, Limitations, and Extensions

Advantages:

Conveys long-range temporal dependencies without necessitating very deep CNNs or unrolling extensive RNNs.
Channel-selective modulation by the RNN is parameter-efficient.
Modular and compatible with a range of convolutional architectures for audio, text, or vision.

Limitations:

Introduces the need to unroll an RNN over $T$ steps, albeit with a small hidden state.
Modulation is coarse (per-channel shift and scale), potentially less effective for tasks requiring precise intra-window timing.

Potential Extensions:

Substitution of the RNN with a self-attention mechanism (yielding an “attention-based FiLM generator”) for longer-range interactions.
Stacking recurrent FiLM generators at various depths, enabling “deep temporal modulation.”
Multi-modal fusion by learning $z_t$ jointly from diverse sources (e.g., language and vision).
Integration with conditional normalization layers for further gains.

6. Pseudocode and Workflow Summary

A compact pseudocode representation (PyTorch-like) is as follows:

for t in range(T):
    h_t = conv_backbone(x[t–k+1:t])           # shape (B, C, L)
    z_t = h_t.mean(dim=2)                     # global avg-pool -> (B, C)
    z_t = proj(z_t)                           # -> (B, H)
    s_t = rnn_cell(z_t, s_{t–1})              # -> (B, H)
    gamma_beta = linear(s_t)                  # -> (B, 2C)
    γ_t, β_t = gamma_beta.chunk(2, dim=1)     # each (B, C)
    ĥ_t = γ_t[:, :, None] * h_t + β_t[:, :, None]
    out_t = classify_or_decode(ĥ_t)

All components—recurrent state evolution, per-channel linear modulation, and efficient convolutional feature extraction—together define the Temporal FiLM paradigm and its role as a recurrent FiLM generator for sequence modeling (Birnbaum et al., 2019).

Markdown Report Issue Upgrade to Chat

References (1)

Temporal FiLM: Capturing Long-Range Sequence Dependencies with Feature-Wise Modulations (2019)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Recurrent FiLM Generators.