Recurrent FiLM Generators for Sequence Modeling
- Recurrent FiLM generators are architectural modules that dynamically modulate CNN activations using RNN-produced scaling and shifting parameters, efficiently capturing long-range dependencies in sequential data such as text, audio, or genomic sequences.
- They integrate a convolutional backbone with a recurrent network to generate adaptive FiLM parameters, offering improved performance over static modulations and deep pure-CNNs, while maintaining computational efficiency.
- Empirical evaluations demonstrate that recurrent FiLM generators enhance accuracy in text classification, boost audio super-resolution quality, and reduce perplexity in language modeling with modest computational overhead.
Recurrent FiLM generators are architectural components designed to dynamically modulate convolutional neural network (CNN) activations through feature-wise linear modulation (FiLM) parameters produced by a recurrent neural network (RNN). This construction, exemplified by the Temporal FiLM (TFiLM) module, enables efficient capture of long-range dependencies in sequential data such as text, audio, or genomic sequences by allowing information from prior time steps to influence the current convolutional activations via learned, adaptive scaling and shifting coefficients (Birnbaum et al., 2019).
1. High-Level Data Flow and Architectural Overview
A recurrent FiLM generator processes a sequence of inputs . A convolutional backbone (typically 1D convolutions with dilation or pooling) ingests a windowed subset of recent inputs at each time step , producing a feature map , where is the number of feature channels and is the spatial or temporal extent. In parallel, an RNN (e.g., gated recurrent unit (GRU) or long short-term memory (LSTM)) maintains a hidden state that evolves over time.
At each time step, the RNN consumes a summary statistic of the convolutional output (such as global average pooling over ) or a direct embedding of (or both) as input , updating its hidden state:
The RNN then predicts per-channel FiLM scale and shift parameters via a linear projection:
These coefficients modulate the convolutional map as:
This modulated map is forwarded to subsequent convolutional layers, classifiers, or decoders. The RNN’s temporal dynamics allow arbitrarily long-range dependencies to be encoded into the feature-wise modulations of the CNN, outperforming pure feed-forward convolutions (with bounded receptive fields) and offering substantial computational advantages compared to deep recurrent stacks.
2. Mathematical Formulation
At time , the system can be formalized as:
- Feature extraction via convolution:
- RNN update (with a function of or ):
where , then projected to .
- FiLM parameter generation:
- Feature-wise modulation:
Optionally, the modulated feature map is further processed (e.g., by passing through additional convolutions or non-linearity) or used for prediction.
3. Implementation Considerations
Key architectural decisions and optimizations include:
- RNN Choices: Single-layer GRU with hidden units is typical; LSTM with –$512$ also viable. Input may concatenate global-pooled (dimension ) and embeddings of , projected via affine layers.
- Integration Points: Commonly, a TFiLM layer follows every convolutional block; for lightweight variants, only the terminal block is modulated.
- Computational Cost: The combined cost of recurrent and linear projections scales as , yielding modest overhead for , , and sequence length in the thousands. The unbounded effective receptive field, provided by recurrence, contrasts sharply with the depth-limited field of pure CNNs.
- Stability and Optimization: Training employs Adam (learning rate ) or SGD with momentum. RNN gradients are clipped (), and stabilization is enhanced via weight normalization on and layer normalization inside the RNN. Dropout (–$0.3$) is applied to RNN inputs and feature maps.
4. Empirical Evaluation
Performance of recurrent FiLM generators was assessed on classification, regression, and sequence modeling tasks:
- Text classification (Yelp, AG News, DBpedia): A 4-block dilated CNN baseline achieves 88% accuracy. Static FiLM (parameters predicted once from the first token) yields 89%, whereas TFiLM with a GRU-generator attains 90.5%, matching/exceeding much deeper pure-CNN or pure-RNN networks. Freezing the FiLM parameters reduces accuracy by 1.2% absolute.
- Audio super-resolution (×4 upsampling at 16kHz): A pure CNN achieves 19 dB SNR; static FiLM improves this to 19.3 dB. TFiLM further raises SNR to 21 dB and exhibits improved high-frequency synthesis.
- Language modeling (Penn Treebank): TFiLM-enhanced CNNs outperform comparable 1D-CNNs by 0.5 perplexity, closely matching a 2-layer LSTM but with reduced parameter count. More than 2 RNN layers yields negligible gains.
For all tasks, TFiLM induces a computational slowdown of 1.1× relative to the base CNN, but remains $2$–$3$ times faster than deep RNNs processing the full sequence.
5. Advantages, Limitations, and Extensions
Advantages:
- Conveys long-range temporal dependencies without necessitating very deep CNNs or unrolling extensive RNNs.
- Channel-selective modulation by the RNN is parameter-efficient.
- Modular and compatible with a range of convolutional architectures for audio, text, or vision.
Limitations:
- Introduces the need to unroll an RNN over steps, albeit with a small hidden state.
- Modulation is coarse (per-channel shift and scale), potentially less effective for tasks requiring precise intra-window timing.
Potential Extensions:
- Substitution of the RNN with a self-attention mechanism (yielding an “attention-based FiLM generator”) for longer-range interactions.
- Stacking recurrent FiLM generators at various depths, enabling “deep temporal modulation.”
- Multi-modal fusion by learning jointly from diverse sources (e.g., language and vision).
- Integration with conditional normalization layers for further gains.
6. Pseudocode and Workflow Summary
A compact pseudocode representation (PyTorch-like) is as follows:
1 2 3 4 5 6 7 8 9 |
for t in range(T): h_t = conv_backbone(x[t–k+1:t]) # shape (B, C, L) z_t = h_t.mean(dim=2) # global avg-pool -> (B, C) z_t = proj(z_t) # -> (B, H) s_t = rnn_cell(z_t, s_{t–1}) # -> (B, H) gamma_beta = linear(s_t) # -> (B, 2C) γ_t, β_t = gamma_beta.chunk(2, dim=1) # each (B, C) ĥ_t = γ_t[:, :, None] * h_t + β_t[:, :, None] out_t = classify_or_decode(ĥ_t) |
All components—recurrent state evolution, per-channel linear modulation, and efficient convolutional feature extraction—together define the Temporal FiLM paradigm and its role as a recurrent FiLM generator for sequence modeling (Birnbaum et al., 2019).