FiLM Layer: Conditional Feature Modulation

Updated 8 February 2026

FiLM layers are conditional modulation units that apply per-feature affine transformations to neural activations based on external inputs.
They integrate linguistic embeddings with visual features in convolutional networks, enabling adaptive, multi-step reasoning.
Empirical results on CLEVR benchmarks show that FiLM significantly improves accuracy and supports zero-shot learning through parameter arithmetic.

FiLM (Feature-wise Linear Modulation) is a general-purpose conditioning layer for neural networks, introduced to facilitate the integration of external information by applying simple, per-feature affine transformations to activations within a network. FiLM enables adaptive modulation of intermediate feature maps conditioned on auxiliary input, such as a linguistic embedding, supporting complex visual reasoning tasks. In the context of vision-and-LLMs, FiLM layers allow a question embedding to influence the computation of visual feature maps, removing the need for hand-crafted reasoning modules and supporting multi-step and high-level compositional reasoning (Perez et al., 2017).

1. Definition and Mathematical Formulation

A FiLM layer operates by allowing a “FiLM generator” (commonly a subnetwork that processes conditioning information) to modulate a “FiLM-ed network” through feature-wise affine transformations. Given the activation $\mathbf{F}_{i,c}$ for the $c$ th feature map and $i$ th sample, FiLM applies:

$\mathrm{FiLM}\bigl(\mathbf{F}_{i,c}\mid \gamma_{i,c}, \beta_{i,c}\bigr) = \gamma_{i,c} \mathbf{F}_{i,c} + \beta_{i,c}$

Here, $\gamma_{i,c}$ (scaling) and $\beta_{i,c}$ (shifting) are modulating parameters, generated as functions of the conditioning input $\mathbf{x}_i$ :

$\gamma_{i,c} = f_c(\mathbf{x}_i), \quad \beta_{i,c} = h_c(\mathbf{x}_i)$

A single FiLM generator network typically produces the full vectors of $\bm\gamma_i$ and $\bm\beta_i$ for all features, allowing downstream vision network layers to be dynamically adjusted based on higher-level context.

2. Architecture and Implementation Details

In a typical vision-and-language application, the FiLM framework comprises two core components:

FiLM generator (linguistic pipeline): Word tokens are transformed into embeddings, processed by a GRU, resulting in a question embedding $c$ 0. Each FiLM layer $c$ 1 receives two per-layer, learned linear projections mapping $c$ 2 to $c$ 3 and $c$ 4, where $c$ 5 is the number of feature maps:

$c$ 6

FiLM-ed network (visual pipeline): The image is processed by a CNN, such as four layers of $c$ $c$ 7 convolutions or a ResNet backbone, producing 128 feature maps of size $c$ $c$ 8. These features pass through four or more residual blocks, each containing:
1. $c$ 9 convolution → ReLU → $i$ 0 convolution → (optional BatchNorm) → FiLM → ReLU → residual addition.
2. (x, y) coordinate feature maps appended for facilitating spatial reasoning. The final head consists of a $i$ 1 convolution to 512 channels, global max pooling, and a two-layer MLP ending in a softmax over answers.

FiLM layers can be placed after the normalization affine transform or elsewhere within residual blocks; empirical results indicate this placement is robust.

3. Empirical Performance and Ablation Results

FiLM layers were evaluated on the CLEVR visual reasoning benchmark (700K questions, 96K images). Direct comparison highlights:

Model	Accuracy (%)	Error (%)
CNN+LSTM+RN	95.5	4.5
FiLM (raw pixels)	97.7	2.3
FiLM (ResNet feats)	97.6	2.4

FiLM approximately halves the state-of-the-art error rate. Performance breakdown by question type demonstrates superiority across diverse reasoning demands, such as counting, attribute comparison, and existential queries. Key ablations reveal:

Using only $i$ 2 ( $i$ 3) yields 96.9% accuracy; only $i$ 4 ( $i$ 5) yields 95.9%. Scaling is more critical than shifting.
Constraining $i$ 6 to (0,1) or (−1,1) degrades accuracy to approximately 96.3%; thus, flexibility in scaling (including negative and larger values) is essential.
Removing all FiLM layers decreases accuracy to the random baseline (21.4%), showing at least one FiLM layer is vital.
The number of FiLM-ed residual blocks influences accuracy: 1 block achieves 93.5%, 2 blocks 97.1%, and 4 blocks 97.4% ± 0.4%, with 6 blocks reaching 97.7%.

FiLM's insertion point within a residual block has minor effects on overall accuracy.

4. Feature Modulation and Qualitative Analysis

FiLM layers modulate activations in a manner that reflects the structure of visual reasoning:

Visualizations at the network's global max pooling locations indicate that FiLM-modulated features are highly focused on image regions relevant to the queried object or answer, effectively imparting implicit spatial attention.
Examining the same feature map before and after FiLM reveals that answering attribute-specific questions (e.g., color) selectively activates regions corresponding to the sought attribute, while leaving activations unchanged for unrelated queries.
t-SNE analysis of $i$ 7 pairs across layers shows early FiLM layers cluster parameters by low-level functions (such as color or shape queries), while later layers correlate with higher-level reasoning functions (such as comparing numbers or materials). This separation demonstrates that FiLM supports emergent functional modularity conditioned on task requirements.

5. Generalization and Zero-shot/Few-shot Capabilities

FiLM exhibits robust generalization to novel linguistic and compositional inputs:

On CLEVR-Humans, with no fine-tuning FiLM attains 56.6% (versus 54.0% for PG+EE). Fine-tuning only the FiLM generator on 18K examples increases this to 75.9% (versus 66.6% for PG+EE), indicating greater efficiency in adapting to new concepts and vocabulary.
On CLEVR-CoGenT (compositional split), FiLM yields 98.3% on Condition A, 75.6% on zero-shot Condition B, and, after fine-tuning on 30K examples from B, achieves 96.9%. FiLM requires approximately one-third as much fine-tuning data as previous state-of-the-art models to reach competitive performance—though catastrophic forgetting is still observed post-adaptation.
Zero-shot capability is demonstrated by FiLM-parameter analogies, inspired by vector arithmetic in word embeddings. For example:

$i$ 8

(and similarly for $i$ 9), enabling correct answers to previously unseen “cyan cube” queries. This procedure improves accuracy on such queries from 71.5% to 80.7%, demonstrating a concrete method for achieving zero-shot visual reasoning by exploiting the structure of FiLM parameter space.

6. Algorithmic Description

The FiLM mechanism is operationalized as follows:

$\mathrm{FiLM}\bigl(\mathbf{F}_{i,c}\mid \gamma_{i,c}, \beta_{i,c}\bigr) = \gamma_{i,c} \mathbf{F}_{i,c} + \beta_{i,c}$ 2

Training employs end-to-end cross-entropy loss on ground-truth answers, stochastic optimization with Adam (learning rate $\mathrm{FiLM}\bigl(\mathbf{F}_{i,c}\mid \gamma_{i,c}, \beta_{i,c}\bigr) = \gamma_{i,c} \mathbf{F}_{i,c} + \beta_{i,c}$ 0), weight decay $\mathrm{FiLM}\bigl(\mathbf{F}_{i,c}\mid \gamma_{i,c}, \beta_{i,c}\bigr) = \gamma_{i,c} \mathbf{F}_{i,c} + \beta_{i,c}$ 1, and early stopping based on validation accuracy (Perez et al., 2017).

7. Significance and Context

FiLM layers represent a general and highly effective method for conditional feature modulation in neural architectures, particularly in domains requiring intricate information transfer between heterogeneous modalities. FiLM achieves multi-step, high-level visual reasoning without the need for explicit, hand-crafted modules, exhibiting strong performance, modularity, and sample efficiency. Its ability to support zero-shot reasoning through parameter arithmetic underscores the learned structure's compositionality and flexibility. These results position FiLM as a foundational approach for modeling relational and compositional structure in vision-and-language tasks (Perez et al., 2017).

Markdown Report Issue Upgrade to Chat

References (1)

FiLM: Visual Reasoning with a General Conditioning Layer (2017)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FiLM Layer.