1D U-Net with FiLM Conditioning
- 1D U-Net with FiLM is a convolutional encoder-decoder architecture that integrates per-sample conditioning via affine modulation of feature maps.
- The model applies Feature-wise Linear Modulation using an MLP on decoder blocks, adapting scales and biases after skip fusion to improve performance in low data regimes.
- Empirical studies demonstrate enhanced segmentation metrics, such as improved Dice scores, and effective source separation performance with minimal extra complexity.
A 1D U-Net architecture with Feature-wise Linear Modulation (FiLM) is a variant of the U-Net model for signal processing that enables per-sample conditioning using external vectors—typically metadata or control signals—by modulating feature maps via affine transformations at selected stages of the network. Originally studied in 2D imaging contexts for segmentation and audio source separation, the application to 1D U-Nets leverages the same architectural principles using 1D convolutions, batch normalization layers, and appropriate gating of feature flow. Modulation is achieved by generating per-channel scaling and bias parameters via a small multilayer perceptron (MLP) conditioned on the external data; these parameters are then broadcast and applied to the intermediate activations. This methodology is particularly advantageous in settings with limited training data, where it demonstrably improves generalization by adapting network responses based on auxiliary information (Jacenków et al., 2019, Meseguer-Brocal et al., 2019).
1. Mathematical Formulation of Feature-wise Linear Modulation
Let denote the activation for batch , sequence position (length) , and channel in a 1D U-Net. Let represent the conditioning vector (e.g., metadata, control signals). FiLM computes scaling and bias per channel by applying a small MLP to : where . For 1D signals, the and vectors are reshaped to and broadcast along the sequence dimension ().
2. 1D U-Net Backbone and FiLM Integration Strategy
The 1D U-Net retains the canonical encoder-decoder structure, adapted as follows:
- Encoder path: resolution reduction via Conv1D+stride or pooling per block, typically with channel progression , , , , . Each block comprises Conv1D BatchNorm1D ReLU.
- Decoder path: Resolution increased symmetrically via ConvTranspose1D (or upsample+Conv1D), concatenating skip connections from the corresponding encoder blocks, followed by two Conv1D BN1D ReLU layers.
- FiLM insertion: In the configuration experimentally found optimal, FiLM conditioning is applied exclusively in decoder blocks—specifically, after skip fusion and the convolutional stack, just before the next upsampling. No FiLM is applied in the encoder or latent/bottleneck representations. Optionally, an additional FiLM ("late FiLM") may be inserted at the logits stage, with effects on performance dependent on task (Jacenków et al., 2019).
Layerwise overview in 1D:
| Stage | Input Shape | Operation | Notes |
|---|---|---|---|
| Encoder block | Conv1D BN1D ReLU | No FiLM applied | |
| Decoder block | (Upsample/TransConv), concat skip; 2×(Conv1D BN1D ReLU); FiLM | FiLM via MLP, broadcast | |
| Final layer | Conv1D, optional late FiLM | Produces segmentation/output logits |
Skip connections from encoder to decoder are not modulated; the FiLM layer acts on the merged features in the decoder, allowing the model to dynamically gate or re-weight multi-scale features conditioned on . Empirically, "decoder fusion" with FiLM significantly outperforms encoding- or bottleneck-side conditioning.
3. Conditioning Vector Handling and FiLM Parameter Generation
The conditioning vector can be any set of non-imaging features (e.g., metadata, control signals, task specifiers, one-hot labels). Practical implementations suggest:
- For low-dimensional (), is embedded via a small MLP to dimensionality or $64$.
- Each FiLM MLP per decoder block receives the shared representation and outputs values (scales and shifts for all channels).
Each FiLM MLP comprises two hidden layers with widths close to , ReLU activations, and produces per-channel () and parameters per sample. For source separation scenarios where the condition is categorical (e.g., instrument selection), the control input is one-hot and embedded via either a fully-connected net or 1D CNN, with parallel heads for generating all block-level or channel-level FiLM parameters (Meseguer-Brocal et al., 2019).
4. Practical Adaptation from 2D to 1D for Signal Processing
Transitioning from 2D to 1D to service tasks such as biomedical signal segmentation or audio, adjustments are:
- Conv2D Conv1D
- BatchNorm2D BatchNorm1D
- TransposedConv2D ConvTranspose1D (stride=2) or Upsample+Conv1D
- Tensor shape:
- FiLM parameter broadcast: reshape to
Hyperparameter choices:
- Channels: 641282565121024 for scales
- Kernel size: 3 or 5
- Depth: 4–5 downsampling/upsampling blocks
- MLP for FiLM: two hidden layers of width ; output
- For metadata with few scalars: initial embedding to or $64$ via a small MLP
A typical receptive field is . The use of FiLM is notably effective in small data regimes, as quantified on the cardiac MRI ACDC dataset, where mean Dice jumped from $0.39$ (baseline) to $0.55$ (FiLM-U-Net) at training size (Jacenków et al., 2019). This suggests the benefit is amplified where supervision is scarce.
5. Empirical Performance and Application Scenarios
On 2D medical segmentation (ACDC), decoder-side FiLM produced a mean Dice of versus for unconditioned U-Nets (full training data). Dramatic relative improvements emerged as training data decreased—implying strong utility in domains where labeled examples are rare. For audio source separation, the Conditioned-U-Net architecture with FiLM equaled or exceeded the performance of specialized, single-task U-Nets across separation metrics SDR, SIR, and SAR, despite using a single parameter-shared model (Meseguer-Brocal et al., 2019). The results demonstrate that feature-wise modulation enables sample-wise adaptation, increasing model flexibility without increasing base model complexity.
6. Architectural Trade-offs and Limitations
Findings from direct experiments are:
- Decoder fusion (FiLM on decoder side) is more effective and stable than encoder- or bottleneck-side integration. Applying FiLM before skip concatenation is suboptimal; post-fusion application after convolution is preferred.
- Skip connections are left unmodulated, and FiLM learns to gate the combined encoder-decoder representations.
- Removing skip connections entirely can allow FiLM to outperform baseline U-Nets (in 2D settings with segmentation), but this is task-dependent.
- In high-data regimes, absolute performance benefits of FiLM may be modest or not statistically significant. Conversely, low-data scenarios realize more tangible gains (Jacenków et al., 2019).
- In source separation, no significant difference was observed between conditioned and single-task U-Nets in global performance metrics (Tukey test; Pearson , ).
7. Implementation Guidelines and Typical Hyperparameters
For practical deployment of a 1D U-Net with FiLM for conditional signal segmentation or source separation, key implementation guidelines are as follows:
- Use five encoder/decoder blocks with channels 64→1024, Conv1D kernel size 3 or 5.
- Downsample (stride=2) per block; upsample symmetrically with ConvTranspose1D or upsampling followed by Conv1D.
- After decoder skip fusion and after the convolution stack, apply FiLM:
- Inputs: shared embedded (via MLP, )
- Per-block FiLM MLP: 2-layer, width channels, output .
- Broadcast scales/shifts along the sequence dimension.
- Optionally, a late FiLM may be applied at the output head.
- For training, Adam optimizer, batch size of 16, learning rate ; L1 loss on appropriate outputs.
- For categorical conditions (e.g., instrument selection), one-hot encode and embed through a small net paralleling the main architecture (Meseguer-Brocal et al., 2019).
- Progressive conditioning noise may aid robustness: regularly inject small perturbations in during training.
In sum, 1D U-Nets with feature-wise linear modulation introduce a powerful, architectural mechanism for test-time adaptive behavior via auxiliary signals, with proven benefits in both medical and audio signal processing domains, particularly under limited-data and multi-task settings.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free