Papers
Topics
Authors
Recent
Search
2000 character limit reached

FiLM Layer: Conditional Feature Modulation

Updated 8 February 2026
  • FiLM layers are conditional modulation units that apply per-feature affine transformations to neural activations based on external inputs.
  • They integrate linguistic embeddings with visual features in convolutional networks, enabling adaptive, multi-step reasoning.
  • Empirical results on CLEVR benchmarks show that FiLM significantly improves accuracy and supports zero-shot learning through parameter arithmetic.

FiLM (Feature-wise Linear Modulation) is a general-purpose conditioning layer for neural networks, introduced to facilitate the integration of external information by applying simple, per-feature affine transformations to activations within a network. FiLM enables adaptive modulation of intermediate feature maps conditioned on auxiliary input, such as a linguistic embedding, supporting complex visual reasoning tasks. In the context of vision-and-LLMs, FiLM layers allow a question embedding to influence the computation of visual feature maps, removing the need for hand-crafted reasoning modules and supporting multi-step and high-level compositional reasoning (Perez et al., 2017).

1. Definition and Mathematical Formulation

A FiLM layer operates by allowing a “FiLM generator” (commonly a subnetwork that processes conditioning information) to modulate a “FiLM-ed network” through feature-wise affine transformations. Given the activation Fi,c\mathbf{F}_{i,c} for the ccth feature map and iith sample, FiLM applies:

FiLM(Fi,cγi,c,βi,c)=γi,cFi,c+βi,c\mathrm{FiLM}\bigl(\mathbf{F}_{i,c}\mid \gamma_{i,c}, \beta_{i,c}\bigr) = \gamma_{i,c} \mathbf{F}_{i,c} + \beta_{i,c}

Here, γi,c\gamma_{i,c} (scaling) and βi,c\beta_{i,c} (shifting) are modulating parameters, generated as functions of the conditioning input xi\mathbf{x}_i:

γi,c=fc(xi),βi,c=hc(xi)\gamma_{i,c} = f_c(\mathbf{x}_i), \quad \beta_{i,c} = h_c(\mathbf{x}_i)

A single FiLM generator network typically produces the full vectors of γi\bm\gamma_i and βi\bm\beta_i for all features, allowing downstream vision network layers to be dynamically adjusted based on higher-level context.

2. Architecture and Implementation Details

In a typical vision-and-language application, the FiLM framework comprises two core components:

  • FiLM generator (linguistic pipeline): Word tokens are transformed into embeddings, processed by a GRU, resulting in a question embedding qiR4096\mathbf{q}_i \in \mathbb{R}^{4096}. Each FiLM layer nn receives two per-layer, learned linear projections mapping qi\mathbf{q}_i to γinRC\bm\gamma_i^n \in \mathbb{R}^C and βinRC\bm\beta_i^n \in \mathbb{R}^C, where CC is the number of feature maps:

γin=Wγnqi+bγn,βin=Wβnqi+bβn\bm\gamma_i^n = W_\gamma^n\,\mathbf{q}_i + b_\gamma^n, \quad \bm\beta_i^n = W_\beta^n\,\mathbf{q}_i + b_\beta^n

  • FiLM-ed network (visual pipeline): The image is processed by a CNN, such as four layers of 4×44 \times 4 convolutions or a ResNet backbone, producing 128 feature maps of size 14×1414 \times 14. These features pass through four or more residual blocks, each containing:
    1. 1×11 \times 1 convolution → ReLU3×33 \times 3 convolution → (optional BatchNorm) → FiLM → ReLU → residual addition.
    2. (x, y) coordinate feature maps appended for facilitating spatial reasoning. The final head consists of a 1×11 \times 1 convolution to 512 channels, global max pooling, and a two-layer MLP ending in a softmax over answers.

FiLM layers can be placed after the normalization affine transform or elsewhere within residual blocks; empirical results indicate this placement is robust.

3. Empirical Performance and Ablation Results

FiLM layers were evaluated on the CLEVR visual reasoning benchmark (700K questions, 96K images). Direct comparison highlights:

Model Accuracy (%) Error (%)
CNN+LSTM+RN 95.5 4.5
FiLM (raw pixels) 97.7 2.3
FiLM (ResNet feats) 97.6 2.4

FiLM approximately halves the state-of-the-art error rate. Performance breakdown by question type demonstrates superiority across diverse reasoning demands, such as counting, attribute comparison, and existential queries. Key ablations reveal:

  • Using only γ\gamma (β=0\beta=0) yields 96.9% accuracy; only β\beta (γ=1\gamma=1) yields 95.9%. Scaling is more critical than shifting.
  • Constraining γ\gamma to (0,1) or (−1,1) degrades accuracy to approximately 96.3%; thus, flexibility in scaling (including negative and larger values) is essential.
  • Removing all FiLM layers decreases accuracy to the random baseline (21.4%), showing at least one FiLM layer is vital.
  • The number of FiLM-ed residual blocks influences accuracy: 1 block achieves 93.5%, 2 blocks 97.1%, and 4 blocks 97.4% ± 0.4%, with 6 blocks reaching 97.7%.

FiLM's insertion point within a residual block has minor effects on overall accuracy.

4. Feature Modulation and Qualitative Analysis

FiLM layers modulate activations in a manner that reflects the structure of visual reasoning:

  • Visualizations at the network's global max pooling locations indicate that FiLM-modulated features are highly focused on image regions relevant to the queried object or answer, effectively imparting implicit spatial attention.
  • Examining the same feature map before and after FiLM reveals that answering attribute-specific questions (e.g., color) selectively activates regions corresponding to the sought attribute, while leaving activations unchanged for unrelated queries.
  • t-SNE analysis of (γ,β)(\gamma,\beta) pairs across layers shows early FiLM layers cluster parameters by low-level functions (such as color or shape queries), while later layers correlate with higher-level reasoning functions (such as comparing numbers or materials). This separation demonstrates that FiLM supports emergent functional modularity conditioned on task requirements.

5. Generalization and Zero-shot/Few-shot Capabilities

FiLM exhibits robust generalization to novel linguistic and compositional inputs:

  • On CLEVR-Humans, with no fine-tuning FiLM attains 56.6% (versus 54.0% for PG+EE). Fine-tuning only the FiLM generator on 18K examples increases this to 75.9% (versus 66.6% for PG+EE), indicating greater efficiency in adapting to new concepts and vocabulary.
  • On CLEVR-CoGenT (compositional split), FiLM yields 98.3% on Condition A, 75.6% on zero-shot Condition B, and, after fine-tuning on 30K examples from B, achieves 96.9%. FiLM requires approximately one-third as much fine-tuning data as previous state-of-the-art models to reach competitive performance—though catastrophic forgetting is still observed post-adaptation.
  • Zero-shot capability is demonstrated by FiLM-parameter analogies, inspired by vector arithmetic in word embeddings. For example:

γcyan cube=γcyan sphere+γbrown cubeγbrown sphere\gamma_{\text{cyan cube}} = \gamma_{\text{cyan sphere}} + \gamma_{\text{brown cube}} - \gamma_{\text{brown sphere}}

(and similarly for β\beta), enabling correct answers to previously unseen “cyan cube” queries. This procedure improves accuracy on such queries from 71.5% to 80.7%, demonstrating a concrete method for achieving zero-shot visual reasoning by exploiting the structure of FiLM parameter space.

6. Algorithmic Description

The FiLM mechanism is operationalized as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
q_i = GRU(x_i)                  # final hidden state

for n in 1..N_layers:
    gamma_i[n], beta_i[n] = LinearProj_n(q_i)

F = CNN(I_i)                    # e.g. 128 × 14 × 14 feature maps

for n in 1..N_layers:
    # a) Residual branch
    h = Conv1x1(F)
    h = ReLU(h)
    h = Conv3x3(h)
    # b) FiLM modulation
    h = gamma_i[n] * h + beta_i[n]
    h = ReLU(h)
    # c) Residual add
    F = F + h

out = Conv1x1_to_512(F)
pooled = GlobalMaxPool(out)
answer_logits = MLP(pooled)
probabilities = softmax(answer_logits)

Training employs end-to-end cross-entropy loss on ground-truth answers, stochastic optimization with Adam (learning rate 3×1043 \times 10^{-4}), weight decay 1×1051 \times 10^{-5}, and early stopping based on validation accuracy (Perez et al., 2017).

7. Significance and Context

FiLM layers represent a general and highly effective method for conditional feature modulation in neural architectures, particularly in domains requiring intricate information transfer between heterogeneous modalities. FiLM achieves multi-step, high-level visual reasoning without the need for explicit, hand-crafted modules, exhibiting strong performance, modularity, and sample efficiency. Its ability to support zero-shot reasoning through parameter arithmetic underscores the learned structure's compositionality and flexibility. These results position FiLM as a foundational approach for modeling relational and compositional structure in vision-and-language tasks (Perez et al., 2017).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FiLM Layer.