Adaptive Layer-Instance Normalization (AdaLIN)
- Adaptive Layer-Instance Normalization (AdaLIN) is a technique that combines instance and layer normalization using a learned per-layer gate to balance local details and global style.
- It adaptively interpolates between instance and layer statistics via the scalar gate ρ, enabling each decoder layer to tailor normalization strategies based on attention-driven signals.
- Empirical results on tasks like selfie-to-anime translation show that AdaLIN improves image quality by preserving identity and enhancing stylization, as indicated by lower KID scores.
Adaptive Layer-Instance Normalization (AdaLIN) is a normalization technique introduced in U-GAT-IT for unsupervised image-to-image translation, designed to interpolate adaptively between Instance Normalization (IN) and Layer Normalization (LN) using a learned, per-layer scalar gate. This approach equips generative models with the flexibility to control the trade-off between local texture (IN) and global shape/style (LN) transformations based on the learning dynamics and specific requirements of image translation tasks (Kim et al., 2019).
1. Mathematical Definition
Given an activation tensor for a single sample, AdaLIN first computes both IN-style and LN-style statistics:
- Instance Normalization (IN):
- Channel-wise mean:
- Channel-wise std:
- Layer Normalization (LN):
- Layer mean:
- Layer std:
Normalized activations:
- IN-style:
- LN-style:
The AdaLIN output is given by
where are per-channel scale and shift, and is a learned layerwise gate. After each gradient step, is updated via backpropagation and clipped to the interval.
2. Learnable Parameters and Their Roles
AdaLIN introduces three principal sets of learnable parameters per layer:
- Gate parameter (): One scalar per AdaLIN layer, initialized to $1$ (favoring IN) in decoder residual blocks and $0$ (favoring LN) in up-sampling blocks. It is updated via gradient descent and clamped to post-update.
- Affine transform (): One scaling () and shifting () parameter per channel. In U-GAT-IT, these are dynamically computed from attention embeddings rather than being static learnable parameters.
- Attention-driven MLP: Three fully connected layers with hidden size 256 and ReLU activations map global-pooled attention features to (). These MLPs are trained end-to-end and are initialized from .
This arrangement allows every decoder layer to adaptively select its own normalization mode and affine transformation based on learned attention signals.
3. Relation to Other Normalization Techniques
AdaLIN generalizes and interpolates between different normalization approaches:
| Normalization | Statistic Scope | Typical Effects | Limitation | AdaLIN Behavior |
|---|---|---|---|---|
| Instance Norm (IN) | Channel over spatial | Style removal, local consistency | May lose global (shape) structure | |
| Layer Norm (LN) | All channels & spatial | Preserves global structure | May oversmooth, weaken local details | |
| AdaIN | Style signal to modulate | Style transfer | Interpolates only style, not normalization | - |
| BIN | Batch/instance interpolation | Batch + instance, fixed blend | Blends BN & IN, not adaptive per sample/layer | - |
In contrast to AdaIN, which interpolates style via external statistics, and BIN, which uses a fixed parameter between BN and IN, AdaLIN adaptively learns per layer to select the appropriate balance point between content and style preservation for each decoder layer (Kim et al., 2019).
4. Integration and Workflow within U-GAT-IT
AdaLIN is integrated exclusively in the decoder of each generator in U-GAT-IT. Its placement and usage are as follows:
- Each decoder residual block contains an AdaLIN layer immediately after its convolution ("AdaResBlock").
- In decoder up-sampling convolutions, AdaLIN is applied before the activation function.
- Encoders use standard Instance Normalization to assist the auxiliary classifier.
- No AdaLIN is employed in discriminators, which instead utilize spectral normalization and a class activation map (CAM)-based attention mechanism.
Key hyperparameters include batch size , learning rate (constant for 500k iterations, linearly decayed to zero by 1M iterations), MLP hidden dimension , overall Adam optimizer settings , and weight decay .
5. Empirical Results and Ablation
Ablation studies on the selfie2anime dataset demonstrate the efficacy of AdaLIN. Measured by Kernel Inception Distance (KID , lower is better):
| Model | KID (selfie2anime) |
|---|---|
| U-GAT-IT (AdaLIN) | |
| Only IN | |
| Only LN | |
| AdaIN | |
| GN |
AdaLIN achieves the lowest KID, indicating superior preservation of identity while enabling appropriate stylization and shape transformation. Qualitative observations report that "only-IN" configurations retain facial accessories but provide insufficient stylization, while "only-LN" results give strong stylization at the cost of identity preservation. In practice, AdaLIN learns in residual blocks (favoring IN, preserving content) and in up-sampling blocks (favoring LN, imposing style/shape).
6. Algorithmic Implementation
A PyTorch-style implementation sketch for AdaLIN, based on a single sample input , is as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
def AdaLIN(x, rho, gamma, beta, eps=1e-5): # x: [C, H, W] C, H, W = x.shape mu_I = x.view(C, -1).mean(dim=1, keepdim=True).view(C, 1, 1) var_I = x.view(C, -1).var(dim=1, keepdim=True, unbiased=False).view(C, 1, 1) sigma_I = torch.sqrt(var_I + eps) mu_L = x.mean() var_L = x.var(unbiased=False) sigma_L = torch.sqrt(var_L + eps) x_I = (x - mu_I) / sigma_I x_L = (x - mu_L) / sigma_L out = gamma.view(C,1,1) * (rho * x_I + (1 - rho) * x_L) + beta.view(C,1,1) return out |
In deployed models, is updated by backpropagation and clamped to after each training step, while are produced by the attention-driven MLP.
7. Significance for Image-to-Image Translation
AdaLIN provides a principled, data-driven method for controlling the balance between content preservation and stylization in image translation. By dynamically blending IN and LN through , AdaLIN enables a fixed-architecture generator to handle diverse translation regimes—from fine texture transfer (e.g., photo to painting) to substantial geometric changes (e.g., selfie to anime)—without manual tuning or architectural modifications. Its effectiveness is empirically demonstrated via improved quantitative metrics and qualitative results on tasks requiring both local and global style adaptation (Kim et al., 2019).