1D U-Net with FiLM Conditioning

Updated 14 November 2025

1D U-Net with FiLM is a convolutional encoder-decoder architecture that integrates per-sample conditioning via affine modulation of feature maps.
The model applies Feature-wise Linear Modulation using an MLP on decoder blocks, adapting scales and biases after skip fusion to improve performance in low data regimes.
Empirical studies demonstrate enhanced segmentation metrics, such as improved Dice scores, and effective source separation performance with minimal extra complexity.

A 1D U-Net architecture with Feature-wise Linear Modulation (FiLM) is a variant of the U-Net model for signal processing that enables per-sample conditioning using external vectors—typically metadata or control signals—by modulating feature maps via affine transformations at selected stages of the network. Originally studied in 2D imaging contexts for segmentation and audio source separation, the application to 1D U-Nets leverages the same architectural principles using 1D convolutions, batch normalization layers, and appropriate gating of feature flow. Modulation is achieved by generating per-channel scaling and bias parameters via a small multilayer perceptron (MLP) conditioned on the external data; these parameters are then broadcast and applied to the intermediate activations. This methodology is particularly advantageous in settings with limited training data, where it demonstrably improves generalization by adapting network responses based on auxiliary information (Jacenków et al., 2019, Meseguer-Brocal et al., 2019).

1. Mathematical Formulation of Feature-wise Linear Modulation

Let $x_{n,\ell,c}$ denote the activation for batch $n$ , sequence position (length) $\ell$ , and channel $c$ in a 1D U-Net. Let $z\in\mathbb{R}^d$ represent the conditioning vector (e.g., metadata, control signals). FiLM computes scaling $\gamma_c(z)$ and bias $\beta_c(z)$ per channel by applying a small MLP to $z$ : $\mathrm{FiLM}(x)_{n,\ell,c} = \gamma_c(z)\, x_{n,\ell,c} + \beta_c(z)$ where $[\gamma(z), \beta(z)] = \mathrm{MLP}(z) \in \mathbb{R}^{2C}$ . For 1D signals, the $\gamma$ and $\beta$ vectors are reshaped to $(N, C, 1)$ and broadcast along the sequence dimension ( $L$ ).

2. 1D U-Net Backbone and FiLM Integration Strategy

The 1D U-Net retains the canonical encoder-decoder structure, adapted as follows:

Encoder path: $L \rightarrow L/2 \rightarrow \dots \rightarrow L/16$ resolution reduction via Conv1D+stride or pooling per block, typically with channel progression $C_1=64$ , $C_2=128$ , $C_3=256$ , $C_4=512$ , $C_5=1024$ . Each block comprises Conv1D $\rightarrow$ BatchNorm1D $\rightarrow$ ReLU.
Decoder path: Resolution increased symmetrically via ConvTranspose1D (or upsample+Conv1D), concatenating skip connections from the corresponding encoder blocks, followed by two Conv1D $\rightarrow$ BN1D $\rightarrow$ ReLU layers.
FiLM insertion: In the configuration experimentally found optimal, FiLM conditioning is applied exclusively in decoder blocks—specifically, after skip fusion and the convolutional stack, just before the next upsampling. No FiLM is applied in the encoder or latent/bottleneck representations. Optionally, an additional FiLM ("late FiLM") may be inserted at the logits stage, with effects on performance dependent on task (Jacenków et al., 2019).

Layerwise overview in 1D:

Stage	Input Shape	Operation	Notes
Encoder block $i$	$(N, C_i, L/2^{i-1})$	Conv1D $\rightarrow$ BN1D $\rightarrow$ ReLU	No FiLM applied
Decoder block $i$	$(N, C_i, L/2^{i-1})$	(Upsample/TransConv), concat skip; 2×(Conv1D $\rightarrow$ BN1D $\rightarrow$ ReLU); FiLM	FiLM via MLP $(z)$ , broadcast
Final layer	$(N, C_1, L)$	$1 \times 1$ Conv1D, optional late FiLM	Produces segmentation/output logits

Skip connections from encoder to decoder are not modulated; the FiLM layer acts on the merged features in the decoder, allowing the model to dynamically gate or re-weight multi-scale features conditioned on $z$ . Empirically, "decoder fusion" with FiLM significantly outperforms encoding- or bottleneck-side conditioning.

3. Conditioning Vector Handling and FiLM Parameter Generation

The conditioning vector $z$ can be any set of non-imaging features (e.g., metadata, control signals, task specifiers, one-hot labels). Practical implementations suggest:

For low-dimensional $z$ ( $d\leq 8$ ), $z$ is embedded via a small MLP to dimensionality $d'\approx 32$ or $64$.
Each FiLM MLP per decoder block receives the shared $z$ representation and outputs $2C_i$ values (scales and shifts for all $C_i$ channels).

Each FiLM MLP comprises two hidden layers with widths close to $C_i$ , ReLU activations, and produces per-channel ( $C_i$ ) $\gamma$ and $\beta$ parameters per sample. For source separation scenarios where the condition is categorical (e.g., instrument selection), the control input is one-hot and embedded via either a fully-connected net or 1D CNN, with parallel heads for generating all block-level or channel-level FiLM parameters (Meseguer-Brocal et al., 2019).

4. Practical Adaptation from 2D to 1D for Signal Processing

Transitioning from 2D to 1D to service tasks such as biomedical signal segmentation or audio, adjustments are:

Conv2D $(k,k)$ $\rightarrow$ Conv1D $(k)$
BatchNorm2D $\rightarrow$ BatchNorm1D
TransposedConv2D $\rightarrow$ ConvTranspose1D (stride=2) or Upsample+Conv1D
Tensor shape: $(N, C, H, W)$ $\rightarrow$ $(N, C, L)$
FiLM parameter broadcast: reshape $\gamma^{(i)},\beta^{(i)}$ to $(N, C_i, 1)$

Hyperparameter choices:

Channels: 64 $\to$ 128 $\to$ 256 $\to$ 512 $\to$ 1024 for $L$ scales
Kernel size: 3 or 5
Depth: 4–5 downsampling/upsampling blocks
MLP for FiLM: two hidden layers of width $C_i$ ; output $2C_i$
For metadata with few scalars: initial embedding to $d=32$ or $64$ via a small MLP

A typical receptive field is $k\cdot 2^{\text{depth}}$ . The use of FiLM is notably effective in small data regimes, as quantified on the cardiac MRI ACDC dataset, where mean Dice jumped from $0.39$ (baseline) to $0.55$ (FiLM-U-Net) at $6\%$ training size (Jacenków et al., 2019). This suggests the benefit is amplified where supervision is scarce.

5. Empirical Performance and Application Scenarios

On 2D medical segmentation (ACDC), decoder-side FiLM produced a mean Dice of $0.91\pm0.02$ versus $0.89\pm0.04$ for unconditioned U-Nets (full training data). Dramatic relative improvements emerged as training data decreased—implying strong utility in domains where labeled examples are rare. For audio source separation, the Conditioned-U-Net architecture with FiLM equaled or exceeded the performance of specialized, single-task U-Nets across separation metrics SDR, SIR, and SAR, despite using a single parameter-shared model (Meseguer-Brocal et al., 2019). The results demonstrate that feature-wise modulation enables sample-wise adaptation, increasing model flexibility without increasing base model complexity.

6. Architectural Trade-offs and Limitations

Findings from direct experiments are:

Decoder fusion (FiLM on decoder side) is more effective and stable than encoder- or bottleneck-side integration. Applying FiLM before skip concatenation is suboptimal; post-fusion application after convolution is preferred.
Skip connections are left unmodulated, and FiLM learns to gate the combined encoder-decoder representations.
Removing skip connections entirely can allow FiLM to outperform baseline U-Nets (in 2D settings with segmentation), but this is task-dependent.
In high-data regimes, absolute performance benefits of FiLM may be modest or not statistically significant. Conversely, low-data scenarios realize more tangible gains (Jacenków et al., 2019).
In source separation, no significant difference was observed between conditioned and single-task U-Nets in global performance metrics (Tukey test; Pearson $r > 0.9$ , $p<10^{-3}$ ).

7. Implementation Guidelines and Typical Hyperparameters

For practical deployment of a 1D U-Net with FiLM for conditional signal segmentation or source separation, key implementation guidelines are as follows:

Use five encoder/decoder blocks with channels 64→1024, Conv1D kernel size 3 or 5.
Downsample (stride=2) per block; upsample symmetrically with ConvTranspose1D or upsampling followed by Conv1D.
After decoder skip fusion and after the convolution stack, apply FiLM:
- Inputs: shared embedded $z$ (via MLP, $d=32-64$ )
- Per-block FiLM MLP: 2-layer, width $\approx$ channels, output $2C_i$ .
- Broadcast scales/shifts along the sequence dimension.
Optionally, a late FiLM may be applied at the output head.
For training, Adam optimizer, batch size of 16, learning rate $1\mathrm{e}{-3}$ ; L1 loss on appropriate outputs.
For categorical conditions (e.g., instrument selection), one-hot encode and embed through a small net paralleling the main architecture (Meseguer-Brocal et al., 2019).
Progressive conditioning noise may aid robustness: regularly inject small perturbations in $z$ during training.

In sum, 1D U-Nets with feature-wise linear modulation introduce a powerful, architectural mechanism for test-time adaptive behavior via auxiliary signals, with proven benefits in both medical and audio signal processing domains, particularly under limited-data and multi-task settings.

PDF Markdown Chat (Pro)

References (2)

Conditioning Convolutional Segmentation Architectures with Non-Imaging Data (2019)

Conditioned-U-Net: Introducing a Control Mechanism in the U-Net for Multiple Source Separations (2019)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to 1D U-Net Architecture with Feature-wise Linear Modulation.