Adaptive Layer-Instance Normalization (AdaLIN)

Updated 11 February 2026

Adaptive Layer-Instance Normalization (AdaLIN) is a technique that combines instance and layer normalization using a learned per-layer gate to balance local details and global style.
It adaptively interpolates between instance and layer statistics via the scalar gate ρ, enabling each decoder layer to tailor normalization strategies based on attention-driven signals.
Empirical results on tasks like selfie-to-anime translation show that AdaLIN improves image quality by preserving identity and enhancing stylization, as indicated by lower KID scores.

Adaptive Layer-Instance Normalization (AdaLIN) is a normalization technique introduced in U-GAT-IT for unsupervised image-to-image translation, designed to interpolate adaptively between Instance Normalization (IN) and Layer Normalization (LN) using a learned, per-layer scalar gate. This approach equips generative models with the flexibility to control the trade-off between local texture (IN) and global shape/style (LN) transformations based on the learning dynamics and specific requirements of image translation tasks (Kim et al., 2019).

1. Mathematical Definition

Given an activation tensor $a \in \mathbb{R}^{C \times H \times W}$ for a single sample, AdaLIN first computes both IN-style and LN-style statistics:

Instance Normalization (IN):
- Channel-wise mean: $\mu_I^c = \frac{1}{HW} \sum_{i=1}^H \sum_{j=1}^W a_{cij}$
- Channel-wise std: $\sigma_I^c = \sqrt{\frac{1}{HW} \sum_{i,j} (a_{cij} - \mu_I^c)^2 + \epsilon}$
Layer Normalization (LN):
- Layer mean: $\mu_L = \frac{1}{CHW} \sum_{c,i,j} a_{cij}$
- Layer std: $\sigma_L = \sqrt{\frac{1}{CHW} \sum_{c,i,j} (a_{cij} - \mu_L)^2 + \epsilon}$

Normalized activations:

IN-style: $\hat a_I = (a - \mu_I) / \sigma_I$
LN-style: $\hat a_L = (a - \mu_L) / \sigma_L$

The AdaLIN output is given by

$\mathrm{AdaLIN}(a;\gamma,\beta,\rho) = \gamma \cdot [\rho \cdot \hat{a}_I + (1-\rho) \cdot \hat{a}_L] + \beta$

where $\gamma, \beta \in \mathbb{R}^C$ are per-channel scale and shift, and $\rho \in [0,1]$ is a learned layerwise gate. After each gradient step, $\rho$ is updated via backpropagation and clipped to the $[0,1]$ interval.

2. Learnable Parameters and Their Roles

AdaLIN introduces three principal sets of learnable parameters per layer:

Gate parameter ( $\rho$ ): One scalar per AdaLIN layer, initialized to $1$ (favoring IN) in decoder residual blocks and $0$ (favoring LN) in up-sampling blocks. It is updated via gradient descent and clamped to $[0,1]$ post-update.
Affine transform ( $\gamma, \beta$ ): One scaling ( $\gamma_c$ ) and shifting ( $\beta_c$ ) parameter per channel. In U-GAT-IT, these are dynamically computed from attention embeddings rather than being static learnable parameters.
Attention-driven MLP: Three fully connected layers with hidden size 256 and ReLU activations map global-pooled attention features to ( $\gamma, \beta$ ). These MLPs are trained end-to-end and are initialized from $\mathcal{N}(0,0.02)$ .

This arrangement allows every decoder layer to adaptively select its own normalization mode and affine transformation based on learned attention signals.

3. Relation to Other Normalization Techniques

AdaLIN generalizes and interpolates between different normalization approaches:

Normalization	Statistic Scope	Typical Effects	Limitation	AdaLIN Behavior
Instance Norm (IN)	Channel over spatial	Style removal, local consistency	May lose global (shape) structure	$\rho = 1.0$
Layer Norm (LN)	All channels & spatial	Preserves global structure	May oversmooth, weaken local details	$\rho = 0.0$
AdaIN	Style signal to modulate	Style transfer	Interpolates only style, not normalization	-
BIN	Batch/instance interpolation	Batch + instance, fixed blend	Blends BN & IN, not adaptive per sample/layer	-

In contrast to AdaIN, which interpolates style via external statistics, and BIN, which uses a fixed parameter between BN and IN, AdaLIN adaptively learns $\rho$ per layer to select the appropriate balance point between content and style preservation for each decoder layer (Kim et al., 2019).

4. Integration and Workflow within U-GAT-IT

AdaLIN is integrated exclusively in the decoder of each generator in U-GAT-IT. Its placement and usage are as follows:

Each decoder residual block contains an AdaLIN layer immediately after its $3 \times 3$ convolution ("AdaResBlock").
In decoder up-sampling convolutions, AdaLIN is applied before the activation function.
Encoders use standard Instance Normalization to assist the auxiliary classifier.
No AdaLIN is employed in discriminators, which instead utilize spectral normalization and a class activation map (CAM)-based attention mechanism.

Key hyperparameters include batch size $= 1$ , learning rate $= 10^{-4}$ (constant for 500k iterations, linearly decayed to zero by 1M iterations), $\gamma,\beta$ MLP hidden dimension $= 256$ , overall Adam optimizer settings $(\beta_1=0.5, \beta_2=0.999)$ , and weight decay $= 10^{-4}$ .

5. Empirical Results and Ablation

Ablation studies on the selfie2anime dataset demonstrate the efficacy of AdaLIN. Measured by Kernel Inception Distance (KID $\times100$ , lower is better):

Model	KID (selfie2anime) $\downarrow$
U-GAT-IT (AdaLIN)	$11.61 \pm 0.57$
Only IN	$13.64 \pm 0.76$
Only LN	$12.39 \pm 0.61$
AdaIN	$12.29 \pm 0.78$
GN	$12.76 \pm 0.64$

AdaLIN achieves the lowest KID, indicating superior preservation of identity while enabling appropriate stylization and shape transformation. Qualitative observations report that "only-IN" configurations retain facial accessories but provide insufficient stylization, while "only-LN" results give strong stylization at the cost of identity preservation. In practice, AdaLIN learns $\rho \to 1$ in residual blocks (favoring IN, preserving content) and $\rho \to 0$ in up-sampling blocks (favoring LN, imposing style/shape).

6. Algorithmic Implementation

A PyTorch-style implementation sketch for AdaLIN, based on a single sample input $x \in \mathbb{R}^{C \times H \times W}$ , is as follows:

def AdaLIN(x, rho, gamma, beta, eps=1e-5):
    # x: [C, H, W]
    C, H, W = x.shape
    mu_I = x.view(C, -1).mean(dim=1, keepdim=True).view(C, 1, 1)
    var_I = x.view(C, -1).var(dim=1, keepdim=True, unbiased=False).view(C, 1, 1)
    sigma_I = torch.sqrt(var_I + eps)
    mu_L = x.mean()
    var_L = x.var(unbiased=False)
    sigma_L = torch.sqrt(var_L + eps)
    x_I = (x - mu_I) / sigma_I
    x_L = (x - mu_L) / sigma_L
    out = gamma.view(C,1,1) * (rho * x_I + (1 - rho) * x_L) + beta.view(C,1,1)
    return out

In deployed models, $\rho$ is updated by backpropagation and clamped to $[0,1]$ after each training step, while $\gamma, \beta$ are produced by the attention-driven MLP.

7. Significance for Image-to-Image Translation

AdaLIN provides a principled, data-driven method for controlling the balance between content preservation and stylization in image translation. By dynamically blending IN and LN through $\rho$ , AdaLIN enables a fixed-architecture generator to handle diverse translation regimes—from fine texture transfer (e.g., photo to painting) to substantial geometric changes (e.g., selfie to anime)—without manual tuning or architectural modifications. Its effectiveness is empirically demonstrated via improved quantitative metrics and qualitative results on tasks requiring both local and global style adaptation (Kim et al., 2019).

Markdown Report Issue Upgrade to Chat

References (1)

U-GAT-IT: Unsupervised Generative Attentional Networks with Adaptive Layer-Instance Normalization for Image-to-Image Translation (2019)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Adaptive Layer-Instance Normalization (AdaLIN).