Papers
Topics
Authors
Recent
Search
2000 character limit reached

Adaptive Layer-Instance Normalization (AdaLIN)

Updated 11 February 2026
  • Adaptive Layer-Instance Normalization (AdaLIN) is a technique that combines instance and layer normalization using a learned per-layer gate to balance local details and global style.
  • It adaptively interpolates between instance and layer statistics via the scalar gate ρ, enabling each decoder layer to tailor normalization strategies based on attention-driven signals.
  • Empirical results on tasks like selfie-to-anime translation show that AdaLIN improves image quality by preserving identity and enhancing stylization, as indicated by lower KID scores.

Adaptive Layer-Instance Normalization (AdaLIN) is a normalization technique introduced in U-GAT-IT for unsupervised image-to-image translation, designed to interpolate adaptively between Instance Normalization (IN) and Layer Normalization (LN) using a learned, per-layer scalar gate. This approach equips generative models with the flexibility to control the trade-off between local texture (IN) and global shape/style (LN) transformations based on the learning dynamics and specific requirements of image translation tasks (Kim et al., 2019).

1. Mathematical Definition

Given an activation tensor aRC×H×Wa \in \mathbb{R}^{C \times H \times W} for a single sample, AdaLIN first computes both IN-style and LN-style statistics:

  • Instance Normalization (IN):
    • Channel-wise mean: μIc=1HWi=1Hj=1Wacij\mu_I^c = \frac{1}{HW} \sum_{i=1}^H \sum_{j=1}^W a_{cij}
    • Channel-wise std: σIc=1HWi,j(acijμIc)2+ϵ\sigma_I^c = \sqrt{\frac{1}{HW} \sum_{i,j} (a_{cij} - \mu_I^c)^2 + \epsilon}
  • Layer Normalization (LN):
    • Layer mean: μL=1CHWc,i,jacij\mu_L = \frac{1}{CHW} \sum_{c,i,j} a_{cij}
    • Layer std: σL=1CHWc,i,j(acijμL)2+ϵ\sigma_L = \sqrt{\frac{1}{CHW} \sum_{c,i,j} (a_{cij} - \mu_L)^2 + \epsilon}

Normalized activations:

  • IN-style: a^I=(aμI)/σI\hat a_I = (a - \mu_I) / \sigma_I
  • LN-style: a^L=(aμL)/σL\hat a_L = (a - \mu_L) / \sigma_L

The AdaLIN output is given by

AdaLIN(a;γ,β,ρ)=γ[ρa^I+(1ρ)a^L]+β\mathrm{AdaLIN}(a;\gamma,\beta,\rho) = \gamma \cdot [\rho \cdot \hat{a}_I + (1-\rho) \cdot \hat{a}_L] + \beta

where γ,βRC\gamma, \beta \in \mathbb{R}^C are per-channel scale and shift, and ρ[0,1]\rho \in [0,1] is a learned layerwise gate. After each gradient step, ρ\rho is updated via backpropagation and clipped to the [0,1][0,1] interval.

2. Learnable Parameters and Their Roles

AdaLIN introduces three principal sets of learnable parameters per layer:

  • Gate parameter (ρ\rho): One scalar per AdaLIN layer, initialized to $1$ (favoring IN) in decoder residual blocks and $0$ (favoring LN) in up-sampling blocks. It is updated via gradient descent and clamped to [0,1][0,1] post-update.
  • Affine transform (γ,β\gamma, \beta): One scaling (γc\gamma_c) and shifting (βc\beta_c) parameter per channel. In U-GAT-IT, these are dynamically computed from attention embeddings rather than being static learnable parameters.
  • Attention-driven MLP: Three fully connected layers with hidden size 256 and ReLU activations map global-pooled attention features to (γ,β\gamma, \beta). These MLPs are trained end-to-end and are initialized from N(0,0.02)\mathcal{N}(0,0.02).

This arrangement allows every decoder layer to adaptively select its own normalization mode and affine transformation based on learned attention signals.

3. Relation to Other Normalization Techniques

AdaLIN generalizes and interpolates between different normalization approaches:

Normalization Statistic Scope Typical Effects Limitation AdaLIN Behavior
Instance Norm (IN) Channel over spatial Style removal, local consistency May lose global (shape) structure ρ=1.0\rho = 1.0
Layer Norm (LN) All channels & spatial Preserves global structure May oversmooth, weaken local details ρ=0.0\rho = 0.0
AdaIN Style signal to modulate Style transfer Interpolates only style, not normalization -
BIN Batch/instance interpolation Batch + instance, fixed blend Blends BN & IN, not adaptive per sample/layer -

In contrast to AdaIN, which interpolates style via external statistics, and BIN, which uses a fixed parameter between BN and IN, AdaLIN adaptively learns ρ\rho per layer to select the appropriate balance point between content and style preservation for each decoder layer (Kim et al., 2019).

4. Integration and Workflow within U-GAT-IT

AdaLIN is integrated exclusively in the decoder of each generator in U-GAT-IT. Its placement and usage are as follows:

  • Each decoder residual block contains an AdaLIN layer immediately after its 3×33 \times 3 convolution ("AdaResBlock").
  • In decoder up-sampling convolutions, AdaLIN is applied before the activation function.
  • Encoders use standard Instance Normalization to assist the auxiliary classifier.
  • No AdaLIN is employed in discriminators, which instead utilize spectral normalization and a class activation map (CAM)-based attention mechanism.

Key hyperparameters include batch size =1= 1, learning rate =104= 10^{-4} (constant for 500k iterations, linearly decayed to zero by 1M iterations), γ,β\gamma,\beta MLP hidden dimension =256= 256, overall Adam optimizer settings (β1=0.5,β2=0.999)(\beta_1=0.5, \beta_2=0.999), and weight decay =104= 10^{-4}.

5. Empirical Results and Ablation

Ablation studies on the selfie2anime dataset demonstrate the efficacy of AdaLIN. Measured by Kernel Inception Distance (KID ×100\times100, lower is better):

Model KID (selfie2anime) \downarrow
U-GAT-IT (AdaLIN) 11.61±0.5711.61 \pm 0.57
Only IN 13.64±0.7613.64 \pm 0.76
Only LN 12.39±0.6112.39 \pm 0.61
AdaIN 12.29±0.7812.29 \pm 0.78
GN 12.76±0.6412.76 \pm 0.64

AdaLIN achieves the lowest KID, indicating superior preservation of identity while enabling appropriate stylization and shape transformation. Qualitative observations report that "only-IN" configurations retain facial accessories but provide insufficient stylization, while "only-LN" results give strong stylization at the cost of identity preservation. In practice, AdaLIN learns ρ1\rho \to 1 in residual blocks (favoring IN, preserving content) and ρ0\rho \to 0 in up-sampling blocks (favoring LN, imposing style/shape).

6. Algorithmic Implementation

A PyTorch-style implementation sketch for AdaLIN, based on a single sample input xRC×H×Wx \in \mathbb{R}^{C \times H \times W}, is as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
def AdaLIN(x, rho, gamma, beta, eps=1e-5):
    # x: [C, H, W]
    C, H, W = x.shape
    mu_I = x.view(C, -1).mean(dim=1, keepdim=True).view(C, 1, 1)
    var_I = x.view(C, -1).var(dim=1, keepdim=True, unbiased=False).view(C, 1, 1)
    sigma_I = torch.sqrt(var_I + eps)
    mu_L = x.mean()
    var_L = x.var(unbiased=False)
    sigma_L = torch.sqrt(var_L + eps)
    x_I = (x - mu_I) / sigma_I
    x_L = (x - mu_L) / sigma_L
    out = gamma.view(C,1,1) * (rho * x_I + (1 - rho) * x_L) + beta.view(C,1,1)
    return out

In deployed models, ρ\rho is updated by backpropagation and clamped to [0,1][0,1] after each training step, while γ,β\gamma, \beta are produced by the attention-driven MLP.

7. Significance for Image-to-Image Translation

AdaLIN provides a principled, data-driven method for controlling the balance between content preservation and stylization in image translation. By dynamically blending IN and LN through ρ\rho, AdaLIN enables a fixed-architecture generator to handle diverse translation regimes—from fine texture transfer (e.g., photo to painting) to substantial geometric changes (e.g., selfie to anime)—without manual tuning or architectural modifications. Its effectiveness is empirically demonstrated via improved quantitative metrics and qualitative results on tasks requiring both local and global style adaptation (Kim et al., 2019).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Adaptive Layer-Instance Normalization (AdaLIN).