Papers
Topics
Authors
Recent
2000 character limit reached

1D U-Net with FiLM Conditioning

Updated 14 November 2025
  • 1D U-Net with FiLM is a convolutional encoder-decoder architecture that integrates per-sample conditioning via affine modulation of feature maps.
  • The model applies Feature-wise Linear Modulation using an MLP on decoder blocks, adapting scales and biases after skip fusion to improve performance in low data regimes.
  • Empirical studies demonstrate enhanced segmentation metrics, such as improved Dice scores, and effective source separation performance with minimal extra complexity.

A 1D U-Net architecture with Feature-wise Linear Modulation (FiLM) is a variant of the U-Net model for signal processing that enables per-sample conditioning using external vectors—typically metadata or control signals—by modulating feature maps via affine transformations at selected stages of the network. Originally studied in 2D imaging contexts for segmentation and audio source separation, the application to 1D U-Nets leverages the same architectural principles using 1D convolutions, batch normalization layers, and appropriate gating of feature flow. Modulation is achieved by generating per-channel scaling and bias parameters via a small multilayer perceptron (MLP) conditioned on the external data; these parameters are then broadcast and applied to the intermediate activations. This methodology is particularly advantageous in settings with limited training data, where it demonstrably improves generalization by adapting network responses based on auxiliary information (Jacenków et al., 2019, Meseguer-Brocal et al., 2019).

1. Mathematical Formulation of Feature-wise Linear Modulation

Let xn,ℓ,cx_{n,\ell,c} denote the activation for batch nn, sequence position (length) ℓ\ell, and channel cc in a 1D U-Net. Let z∈Rdz\in\mathbb{R}^d represent the conditioning vector (e.g., metadata, control signals). FiLM computes scaling γc(z)\gamma_c(z) and bias βc(z)\beta_c(z) per channel by applying a small MLP to zz: FiLM(x)n,ℓ,c=γc(z) xn,ℓ,c+βc(z)\mathrm{FiLM}(x)_{n,\ell,c} = \gamma_c(z)\, x_{n,\ell,c} + \beta_c(z) where [γ(z),β(z)]=MLP(z)∈R2C[\gamma(z), \beta(z)] = \mathrm{MLP}(z) \in \mathbb{R}^{2C}. For 1D signals, the γ\gamma and β\beta vectors are reshaped to (N,C,1)(N, C, 1) and broadcast along the sequence dimension (LL).

2. 1D U-Net Backbone and FiLM Integration Strategy

The 1D U-Net retains the canonical encoder-decoder structure, adapted as follows:

  • Encoder path: L→L/2→⋯→L/16L \rightarrow L/2 \rightarrow \dots \rightarrow L/16 resolution reduction via Conv1D+stride or pooling per block, typically with channel progression C1=64C_1=64, C2=128C_2=128, C3=256C_3=256, C4=512C_4=512, C5=1024C_5=1024. Each block comprises Conv1D →\rightarrow BatchNorm1D →\rightarrow ReLU.
  • Decoder path: Resolution increased symmetrically via ConvTranspose1D (or upsample+Conv1D), concatenating skip connections from the corresponding encoder blocks, followed by two Conv1D →\rightarrow BN1D →\rightarrow ReLU layers.
  • FiLM insertion: In the configuration experimentally found optimal, FiLM conditioning is applied exclusively in decoder blocks—specifically, after skip fusion and the convolutional stack, just before the next upsampling. No FiLM is applied in the encoder or latent/bottleneck representations. Optionally, an additional FiLM ("late FiLM") may be inserted at the logits stage, with effects on performance dependent on task (Jacenków et al., 2019).

Layerwise overview in 1D:

Stage Input Shape Operation Notes
Encoder block ii (N,Ci,L/2i−1)(N, C_i, L/2^{i-1}) Conv1D →\rightarrow BN1D →\rightarrow ReLU No FiLM applied
Decoder block ii (N,Ci,L/2i−1)(N, C_i, L/2^{i-1}) (Upsample/TransConv), concat skip; 2×(Conv1D →\rightarrow BN1D →\rightarrow ReLU); FiLM FiLM via MLP(z)(z), broadcast
Final layer (N,C1,L)(N, C_1, L) 1×11 \times 1 Conv1D, optional late FiLM Produces segmentation/output logits

Skip connections from encoder to decoder are not modulated; the FiLM layer acts on the merged features in the decoder, allowing the model to dynamically gate or re-weight multi-scale features conditioned on zz. Empirically, "decoder fusion" with FiLM significantly outperforms encoding- or bottleneck-side conditioning.

3. Conditioning Vector Handling and FiLM Parameter Generation

The conditioning vector zz can be any set of non-imaging features (e.g., metadata, control signals, task specifiers, one-hot labels). Practical implementations suggest:

  • For low-dimensional zz (d≤8d\leq 8), zz is embedded via a small MLP to dimensionality d′≈32d'\approx 32 or $64$.
  • Each FiLM MLP per decoder block receives the shared zz representation and outputs 2Ci2C_i values (scales and shifts for all CiC_i channels).

Each FiLM MLP comprises two hidden layers with widths close to CiC_i, ReLU activations, and produces per-channel (CiC_i) γ\gamma and β\beta parameters per sample. For source separation scenarios where the condition is categorical (e.g., instrument selection), the control input is one-hot and embedded via either a fully-connected net or 1D CNN, with parallel heads for generating all block-level or channel-level FiLM parameters (Meseguer-Brocal et al., 2019).

4. Practical Adaptation from 2D to 1D for Signal Processing

Transitioning from 2D to 1D to service tasks such as biomedical signal segmentation or audio, adjustments are:

  • Conv2D (k,k)(k,k) →\rightarrow Conv1D (k)(k)
  • BatchNorm2D →\rightarrow BatchNorm1D
  • TransposedConv2D →\rightarrow ConvTranspose1D (stride=2) or Upsample+Conv1D
  • Tensor shape: (N,C,H,W)(N, C, H, W) →\rightarrow (N,C,L)(N, C, L)
  • FiLM parameter broadcast: reshape γ(i),β(i)\gamma^{(i)},\beta^{(i)} to (N,Ci,1)(N, C_i, 1)

Hyperparameter choices:

  • Channels: 64→\to128→\to256→\to512→\to1024 for LL scales
  • Kernel size: 3 or 5
  • Depth: 4–5 downsampling/upsampling blocks
  • MLP for FiLM: two hidden layers of width CiC_i; output 2Ci2C_i
  • For metadata with few scalars: initial embedding to d=32d=32 or $64$ via a small MLP

A typical receptive field is k⋅2depthk\cdot 2^{\text{depth}}. The use of FiLM is notably effective in small data regimes, as quantified on the cardiac MRI ACDC dataset, where mean Dice jumped from $0.39$ (baseline) to $0.55$ (FiLM-U-Net) at 6%6\% training size (Jacenków et al., 2019). This suggests the benefit is amplified where supervision is scarce.

5. Empirical Performance and Application Scenarios

On 2D medical segmentation (ACDC), decoder-side FiLM produced a mean Dice of 0.91±0.020.91\pm0.02 versus 0.89±0.040.89\pm0.04 for unconditioned U-Nets (full training data). Dramatic relative improvements emerged as training data decreased—implying strong utility in domains where labeled examples are rare. For audio source separation, the Conditioned-U-Net architecture with FiLM equaled or exceeded the performance of specialized, single-task U-Nets across separation metrics SDR, SIR, and SAR, despite using a single parameter-shared model (Meseguer-Brocal et al., 2019). The results demonstrate that feature-wise modulation enables sample-wise adaptation, increasing model flexibility without increasing base model complexity.

6. Architectural Trade-offs and Limitations

Findings from direct experiments are:

  • Decoder fusion (FiLM on decoder side) is more effective and stable than encoder- or bottleneck-side integration. Applying FiLM before skip concatenation is suboptimal; post-fusion application after convolution is preferred.
  • Skip connections are left unmodulated, and FiLM learns to gate the combined encoder-decoder representations.
  • Removing skip connections entirely can allow FiLM to outperform baseline U-Nets (in 2D settings with segmentation), but this is task-dependent.
  • In high-data regimes, absolute performance benefits of FiLM may be modest or not statistically significant. Conversely, low-data scenarios realize more tangible gains (Jacenków et al., 2019).
  • In source separation, no significant difference was observed between conditioned and single-task U-Nets in global performance metrics (Tukey test; Pearson r>0.9r > 0.9, p<10−3p<10^{-3}).

7. Implementation Guidelines and Typical Hyperparameters

For practical deployment of a 1D U-Net with FiLM for conditional signal segmentation or source separation, key implementation guidelines are as follows:

  • Use five encoder/decoder blocks with channels 64→1024, Conv1D kernel size 3 or 5.
  • Downsample (stride=2) per block; upsample symmetrically with ConvTranspose1D or upsampling followed by Conv1D.
  • After decoder skip fusion and after the convolution stack, apply FiLM:
    • Inputs: shared embedded zz (via MLP, d=32−64d=32-64)
    • Per-block FiLM MLP: 2-layer, width ≈\approx channels, output 2Ci2C_i.
    • Broadcast scales/shifts along the sequence dimension.
  • Optionally, a late FiLM may be applied at the output head.
  • For training, Adam optimizer, batch size of 16, learning rate 1e−31\mathrm{e}{-3}; L1 loss on appropriate outputs.
  • For categorical conditions (e.g., instrument selection), one-hot encode and embed through a small net paralleling the main architecture (Meseguer-Brocal et al., 2019).
  • Progressive conditioning noise may aid robustness: regularly inject small perturbations in zz during training.

In sum, 1D U-Nets with feature-wise linear modulation introduce a powerful, architectural mechanism for test-time adaptive behavior via auxiliary signals, with proven benefits in both medical and audio signal processing domains, particularly under limited-data and multi-task settings.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to 1D U-Net Architecture with Feature-wise Linear Modulation.