Style-Based Global Appearance Flow

Updated 1 March 2026

The paper introduces a unified framework that leverages a global style vector combined with local refinement to generate dense appearance flows for realistic virtual try-on and tone mapping.
It employs modulated convolutions and conditional invertible neural networks to handle large spatial misalignments, occlusions, and stylistic variations.
Empirical results show significant improvements in SSIM, FID, and PSNR, underlining the method's robustness and superior performance over local-only approaches.

Style-Based Global Appearance Flow constitutes a class of dense image-to-image transformation methods that incorporate global image "style" as an explicit conditioning mechanism to guide appearance flow prediction. These methods are designed to address tasks where local correspondence estimation alone is inadequate due to large spatial misalignments, occlusions, or image-wide stylistic variations. The style-based global appearance flow paradigm has been instantiated in domains such as virtual try-on and global color or tone mapping, employing architectures such as modulated convolution networks and conditional normalizing flows to encode image-wide context and distill interpretable style representations (He et al., 2022, Mustafa et al., 2022).

1. Problem Formulation and Limitations of Local Flow

Style-based global appearance flow methods aim to produce a dense mapping—"appearance flow"—from a source to a target image while accounting for global contextual and stylistic factors. In the context of virtual try-on, this problem can be expressed as taking a person image $p \in \mathbb{R}^{3 \times H \times W}$ and an in-shop garment image $g \in \mathbb{R}^{3 \times H \times W}$ , and generating a new image $t = \mathcal{F}(p, g)$ in which the person now realistically wears $g$ .

The core challenge is to learn a dense flow field

$F: (u, v) \mapsto (u', v')$

indicating, for every pixel location $(u,v)$ in the output, from which coordinates $(u', v')$ in the source to sample. Traditional local appearance flow estimation methods, such as those based on correlating small feature patches [e.g., ClothFlow, PF-AFN], assume spatial locality and pre-alignment between source and target. These assumptions fail in scenarios with large pose variance, occlusion, or significant misalignment (e.g., full-body person images vs. cropped garment images). The same limitations apply to classical color tonemapping approaches that ignore image-wide stylistic variation (He et al., 2022, Mustafa et al., 2022).

2. Global Style Conditioning and Model Architectures

The key innovation in style-based global appearance flow is the introduction of a global style vector that encodes holistic features of both source and target images. This global style is then used to modulate the parameters of the flow prediction module, enabling image-wide consistency and robustness to occlusions and large transformations.

Virtual Try-On Case: StyleGAN-based Modulated Flow

In "Style-Based Global Appearance Flow for Virtual Try-On" (He et al., 2022), the architecture comprises:

Dual encoders for person and garment images, producing multi-scale features $\{p_i\}_{i=1}^N$ and $\{g_i\}_{i=1}^N$ .
Global style vector $s \in \mathbb{R}^{2c}$ , obtained by concatenating outputs of fully connected layers applied to the final-scale features, $c=256$ in experiments.
Warping module: $N$ $N$ blocks stack, each consisting of:
- Style-modulated convolution ("ConvMod") to predict a coarse global flow, conditioned on $s$ .
- Local refinement via convolution over concatenation of warped garment and person features, producing a flow correction.
- Final flow at each resolution is the sum of coarse and refined flows.
Output generator: U-Net combining the warped garment and person image features to produce the synthetic try-on image.

A single style vector conditions all convolutional weights, affording a fully-global receptive field and enabling alignment even under extreme spatial misalignment.

Global Style in Color/Tone Mapping

"Distilling Style from Image Pairs for Global Forward and Inverse Tone Mapping" (Mustafa et al., 2022) generalizes this paradigm to color mapping. The method models a global, spatially-invariant color mapping:

$\mathcal{M}: \mathbb{R}^3 \times \mathbb{R}^d \rightarrow \mathbb{R}^3$

where $d$ -dimensional style code $\mathbf{z}$ controls per-image tone or color style. The conditional invertible neural network (INN) $g_\theta$ is trained to map source pixels and style code to target pixel values, with conditioning applied on a polynomial expansion of pixel RGB (degree 4, 34 components). The framework enables both forward tone mapping (e.g., RAW→SDR) and inverse mapping, with editability in style space.

3. Mathematical Details and Training Objectives

A unified mathematical formalism characterizes style-based appearance flow methods:

Appearance Flow Field

In virtual try-on:

$F: \Omega \subset \mathbb{R}^2 \to \mathbb{R}^2, \quad F(u, v) = (u', v')$

This field guides bilinear sampling:

$\hat{g}(u, v) = \sum_{i, j} g(i, j) \, \max(0, 1 - |u' - i|) \, \max(0, 1 - |v' - j|)$

$(u',v') = F(u,v)$ .

Losses

Garment reconstruction: $\mathcal{L}_g = \lVert \hat{g} - M_g \odot p_{gt} \rVert_1$
Perceptual loss: $\mathcal{L}_p = \sum_i \lVert \phi_i(t) - \phi_i(p_{gt}) \rVert_1$
Smoothness regularizer: $\mathcal{L}_R = \sum_{i=1}^N \lVert \nabla \mathbf{f}_i \rVert_1$
Feature distillation: $\mathcal{L}_D = \sum_{i=1}^N \lVert p_i^{PB} - p_i \rVert_1$

The overall objective is:

$\mathcal{L} = \lambda_p \mathcal{L}_p + \lambda_g \mathcal{L}_g + \lambda_R \mathcal{L}_R + \lambda_D \mathcal{L}_D$

Color/Tone Mapping Loss

Conditional INN models are trained with negative log-likelihood per pixel and a style-consistency reconstruction loss, e.g.,

$\mathcal{L} = \sum_{i, p} \left[ \mathcal{L}_{\mathrm{NLL}}(\mathbf{x}_p^i, \mathbf{c}_p^i) + \lambda \mathcal{L}_{\mathrm{rec}}^i(\mathbf{x}^i, \mathbf{y}^i) \right]$

where

$\mathcal{L}_{\text{rec}} = \frac{1}{K_i} \sum_{p=1}^{K_i} \| g_\theta(\mathbf{z}^i ; \mathcal{C}(\mathbf{y}_p^i)) - \mathbf{x}_p^i \|_2^2$

4. Empirical Performance and Comparative Analysis

Empirical evaluation demonstrates state-of-the-art performance and improved robustness:

In virtual try-on, the combined global style modulation plus local refinement architecture achieves SSIM=0.91 and FID=8.89, outperforming local-only (SSIM=0.89, FID=10.73) and style-only (SSIM=0.89, FID=9.84) methods on augmented VITON benchmarks (He et al., 2022).
The model's performance under large input perturbations exhibits zero SSIM drop, whereas baseline methods degrade by 0.2–0.3 in SSIM and 2–4 FID points.
For global tone mapping, the conditional INN with style vector achieves PSNR ≈ 39.2 dB on FiveK Expert C (vs 22–26 dB for HDRNet, PCA, VAE), and up to 41 dB for 4-D style, with >50% reduction in FLIP and CIEDE2000 perceptual color-difference metrics (Mustafa et al., 2022).
Compression experiments show competitive performance for concurrent SDR+HDR coding, with the flow-based model achieving >30 dB PU-PSNR at 0.5 bpp.

Application	Main Metric	Style-based Flow Result	Best Previous
Virtual Try-On	SSIM	0.91	0.89 (Local-only)
Virtual Try-On	FID	8.89	10.73 (Local-only)
Tone Mapping (FiveK)	PSNR (dB)	39.2–41.0	22–28 (HDRNet, VAE)
Color Difference	FLIP Reduction	>50%	-

5. Style Representation, Interpretability, and Applications

The explicit modeling of global style confers several advantages:

Robustness: Global style conditioning enables the flow estimator to "see" the entire context, substantially improving handling of spatial misalignments and occlusions.
Interpretability: In color/tone mapping, the 2-D style latent variables correspond to interpretable axes (e.g., brightness/contrast, color temperature); linear traversal yields smooth and semantically meaningful edits in appearance.
Editable and Compressed Representation: The compact style codes facilitate interactive editing and low-bitrate coding for SDR/HDR and color grading.
Downstream Tasks: The paradigm is extensible to any dense mapping problem where global stylistic or spatial consistency must be enforced, subject to availability of image pairs for style distillation.

6. Implementation Considerations and Regularization

Empirical ablations confirm the necessity of both global style and fine-scale refinement:

Use of a single style vector with modulated convolutions is critical for large-receptive-field, globally consistent flow estimation.
Omitting the reconstruction loss in tone mapping leads to scattered, less informative style embeddings and PSNR drop (>5 dB).
For color flow, a 4th-degree polynomial color basis is optimal; lower degrees lead to significant degradation.
Model efficiency: the conditional INN architecture described for tone/style modeling contains only ≈31K parameters, with real-time inference possible at ≈0.025s per 960×540 frame.

Practical tips for reproducibility include normalization of input RGB, use of ActNorm initialization, multiscale MLP subnets, and per-frame batch sampling to ensure coherent style estimation (Mustafa et al., 2022).

7. Outlook and Broader Implications

Style-based global appearance flow methods offer a principled approach to integrating whole-image context and interpretable style in dense, pixel-level mappings. While initial applications have focused on visual domains such as virtual try-on and global tone/color mapping, the paradigm is expected to generalize to other tasks requiring disentanglement and explicit manipulation of global and local appearance factors. A plausible implication is the facilitation of more controllable and robust image synthesis and editing frameworks, as well as compression and cross-domain translation driven by compact stylistic codes. Ongoing research into the design of global style vectors, conditioning mechanisms, and efficient invertible networks is likely to expand the capability and reach of such models (He et al., 2022, Mustafa et al., 2022).

Markdown Report Issue Upgrade to Chat

References (2)

Style-Based Global Appearance Flow for Virtual Try-On (2022)

Distilling Style from Image Pairs for Global Forward and Inverse Tone Mapping (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Style-Based Global Appearance Flow.