Papers
Topics
Authors
Recent
Search
2000 character limit reached

FiLM Layer: Conditional Feature Modulation

Updated 8 February 2026
  • FiLM layers are conditional modulation units that apply per-feature affine transformations to neural activations based on external inputs.
  • They integrate linguistic embeddings with visual features in convolutional networks, enabling adaptive, multi-step reasoning.
  • Empirical results on CLEVR benchmarks show that FiLM significantly improves accuracy and supports zero-shot learning through parameter arithmetic.

FiLM (Feature-wise Linear Modulation) is a general-purpose conditioning layer for neural networks, introduced to facilitate the integration of external information by applying simple, per-feature affine transformations to activations within a network. FiLM enables adaptive modulation of intermediate feature maps conditioned on auxiliary input, such as a linguistic embedding, supporting complex visual reasoning tasks. In the context of vision-and-LLMs, FiLM layers allow a question embedding to influence the computation of visual feature maps, removing the need for hand-crafted reasoning modules and supporting multi-step and high-level compositional reasoning (Perez et al., 2017).

1. Definition and Mathematical Formulation

A FiLM layer operates by allowing a “FiLM generator” (commonly a subnetwork that processes conditioning information) to modulate a “FiLM-ed network” through feature-wise affine transformations. Given the activation Fi,c\mathbf{F}_{i,c} for the ccth feature map and iith sample, FiLM applies:

FiLM(Fi,cγi,c,βi,c)=γi,cFi,c+βi,c\mathrm{FiLM}\bigl(\mathbf{F}_{i,c}\mid \gamma_{i,c}, \beta_{i,c}\bigr) = \gamma_{i,c} \mathbf{F}_{i,c} + \beta_{i,c}

Here, γi,c\gamma_{i,c} (scaling) and βi,c\beta_{i,c} (shifting) are modulating parameters, generated as functions of the conditioning input xi\mathbf{x}_i:

γi,c=fc(xi),βi,c=hc(xi)\gamma_{i,c} = f_c(\mathbf{x}_i), \quad \beta_{i,c} = h_c(\mathbf{x}_i)

A single FiLM generator network typically produces the full vectors of γi\bm\gamma_i and βi\bm\beta_i for all features, allowing downstream vision network layers to be dynamically adjusted based on higher-level context.

2. Architecture and Implementation Details

In a typical vision-and-language application, the FiLM framework comprises two core components:

  • FiLM generator (linguistic pipeline): Word tokens are transformed into embeddings, processed by a GRU, resulting in a question embedding cc0. Each FiLM layer cc1 receives two per-layer, learned linear projections mapping cc2 to cc3 and cc4, where cc5 is the number of feature maps:

cc6

  • FiLM-ed network (visual pipeline): The image is processed by a CNN, such as four layers of cc7 convolutions or a ResNet backbone, producing 128 feature maps of size cc8. These features pass through four or more residual blocks, each containing:
    1. cc9 convolution → ReLUii0 convolution → (optional BatchNorm) → FiLM → ReLU → residual addition.
    2. (x, y) coordinate feature maps appended for facilitating spatial reasoning. The final head consists of a ii1 convolution to 512 channels, global max pooling, and a two-layer MLP ending in a softmax over answers.

FiLM layers can be placed after the normalization affine transform or elsewhere within residual blocks; empirical results indicate this placement is robust.

3. Empirical Performance and Ablation Results

FiLM layers were evaluated on the CLEVR visual reasoning benchmark (700K questions, 96K images). Direct comparison highlights:

Model Accuracy (%) Error (%)
CNN+LSTM+RN 95.5 4.5
FiLM (raw pixels) 97.7 2.3
FiLM (ResNet feats) 97.6 2.4

FiLM approximately halves the state-of-the-art error rate. Performance breakdown by question type demonstrates superiority across diverse reasoning demands, such as counting, attribute comparison, and existential queries. Key ablations reveal:

  • Using only ii2 (ii3) yields 96.9% accuracy; only ii4 (ii5) yields 95.9%. Scaling is more critical than shifting.
  • Constraining ii6 to (0,1) or (−1,1) degrades accuracy to approximately 96.3%; thus, flexibility in scaling (including negative and larger values) is essential.
  • Removing all FiLM layers decreases accuracy to the random baseline (21.4%), showing at least one FiLM layer is vital.
  • The number of FiLM-ed residual blocks influences accuracy: 1 block achieves 93.5%, 2 blocks 97.1%, and 4 blocks 97.4% ± 0.4%, with 6 blocks reaching 97.7%.

FiLM's insertion point within a residual block has minor effects on overall accuracy.

4. Feature Modulation and Qualitative Analysis

FiLM layers modulate activations in a manner that reflects the structure of visual reasoning:

  • Visualizations at the network's global max pooling locations indicate that FiLM-modulated features are highly focused on image regions relevant to the queried object or answer, effectively imparting implicit spatial attention.
  • Examining the same feature map before and after FiLM reveals that answering attribute-specific questions (e.g., color) selectively activates regions corresponding to the sought attribute, while leaving activations unchanged for unrelated queries.
  • t-SNE analysis of ii7 pairs across layers shows early FiLM layers cluster parameters by low-level functions (such as color or shape queries), while later layers correlate with higher-level reasoning functions (such as comparing numbers or materials). This separation demonstrates that FiLM supports emergent functional modularity conditioned on task requirements.

5. Generalization and Zero-shot/Few-shot Capabilities

FiLM exhibits robust generalization to novel linguistic and compositional inputs:

  • On CLEVR-Humans, with no fine-tuning FiLM attains 56.6% (versus 54.0% for PG+EE). Fine-tuning only the FiLM generator on 18K examples increases this to 75.9% (versus 66.6% for PG+EE), indicating greater efficiency in adapting to new concepts and vocabulary.
  • On CLEVR-CoGenT (compositional split), FiLM yields 98.3% on Condition A, 75.6% on zero-shot Condition B, and, after fine-tuning on 30K examples from B, achieves 96.9%. FiLM requires approximately one-third as much fine-tuning data as previous state-of-the-art models to reach competitive performance—though catastrophic forgetting is still observed post-adaptation.
  • Zero-shot capability is demonstrated by FiLM-parameter analogies, inspired by vector arithmetic in word embeddings. For example:

ii8

(and similarly for ii9), enabling correct answers to previously unseen “cyan cube” queries. This procedure improves accuracy on such queries from 71.5% to 80.7%, demonstrating a concrete method for achieving zero-shot visual reasoning by exploiting the structure of FiLM parameter space.

6. Algorithmic Description

The FiLM mechanism is operationalized as follows:

FiLM(Fi,cγi,c,βi,c)=γi,cFi,c+βi,c\mathrm{FiLM}\bigl(\mathbf{F}_{i,c}\mid \gamma_{i,c}, \beta_{i,c}\bigr) = \gamma_{i,c} \mathbf{F}_{i,c} + \beta_{i,c}2

Training employs end-to-end cross-entropy loss on ground-truth answers, stochastic optimization with Adam (learning rate FiLM(Fi,cγi,c,βi,c)=γi,cFi,c+βi,c\mathrm{FiLM}\bigl(\mathbf{F}_{i,c}\mid \gamma_{i,c}, \beta_{i,c}\bigr) = \gamma_{i,c} \mathbf{F}_{i,c} + \beta_{i,c}0), weight decay FiLM(Fi,cγi,c,βi,c)=γi,cFi,c+βi,c\mathrm{FiLM}\bigl(\mathbf{F}_{i,c}\mid \gamma_{i,c}, \beta_{i,c}\bigr) = \gamma_{i,c} \mathbf{F}_{i,c} + \beta_{i,c}1, and early stopping based on validation accuracy (Perez et al., 2017).

7. Significance and Context

FiLM layers represent a general and highly effective method for conditional feature modulation in neural architectures, particularly in domains requiring intricate information transfer between heterogeneous modalities. FiLM achieves multi-step, high-level visual reasoning without the need for explicit, hand-crafted modules, exhibiting strong performance, modularity, and sample efficiency. Its ability to support zero-shot reasoning through parameter arithmetic underscores the learned structure's compositionality and flexibility. These results position FiLM as a foundational approach for modeling relational and compositional structure in vision-and-language tasks (Perez et al., 2017).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FiLM Layer.