Papers
Topics
Authors
Recent
Search
2000 character limit reached

FlowIID: Efficient Intrinsic Image Decomposition

Updated 25 January 2026
  • FlowIID is an intrinsic image decomposition method that separates RGB images into albedo and shading components using a single deterministic pass under the Lambertian model.
  • It employs a unified VAE encoder-decoder with a UNet backbone and latent flow matcher to ensure consistency, stability, and a reduced parameter count compared to multi-step models.
  • It achieves competitive performance on benchmarks—such as a 0.0040 albedo MSE on the MIT Intrinsic dataset—making it ideal for real-time relighting and embedded vision applications.

FlowIID is an intrinsic image decomposition (IID) architecture that factorizes an input RGB image II into its albedo (reflectance, AA) and shading (illumination, SS) components under the standard Lambertian image formation model, I(x)=A(x)S(x)I(x) = A(x) \cdot S(x). This decomposition is foundational for applications such as relighting, material editing, and is increasingly deployed as a preprocessing step for higher-level computer vision pipelines. FlowIID introduces a paradigm shift in IID by leveraging latent flow matching in conjunction with a compact Variational Autoencoder (VAE), enabling deterministic, stable, and parameter-efficient decomposition in a single inference pass. FlowIID achieves competitive or superior accuracy relative to state-of-the-art methods with a fraction of the parameter budget, facilitating practical deployment in real-time and resource-constrained environments (Singla et al., 18 Jan 2026).

1. Problem Formulation and Single-Step Decomposition

The primary objective of intrinsic image decomposition is to retrieve the albedo and shading fields such that I=ASI = A \cdot S. Traditional approaches either utilize separate networks for albedo and shading—risking output inconsistency—or predict only shading and estimate albedo by elementwise division A=I/SA = I / S. Modern deep IID methods often rely on multi-step diffusion networks or large, multi-branch CNNs exceeding hundreds of millions of parameters, limiting their applicability in low-latency or embedded scenarios. FlowIID circumvents these inefficiencies by directly predicting a latent representation of shading in a single forward pass through its encoder and UNet backbone. The decoded shading, together with the input image, yields albedo via A=I/SA = I / S, obviating the need for iterative sampling and bulky architectures while ensuring decomposition consistency.

2. Model Architecture and Workflow

The FlowIID architecture comprises four principal modules:

  • VAE Encoder–Decoder (EE, DD): The VAE operates on ground-truth shading s0RH×Ws_0 \in \mathbb{R}^{H\times W}, encoding it as z0R8×H/8×W/8z_0 \in \mathbb{R}^{8\times H/8\times W/8} and reconstructing shading via D(z0)D(z_0).
  • Image Encoder (Enc\text{Enc}): Six down-sampling blocks utilizing Modified Residual Blocks (MRB), extracting multi-scale features from the input image.
  • UNet Backbone: Two down and two up-pooling blocks with MRB and attention in middle layers, integrating encoder features and latent noise.
  • Latent Flow Matcher (uθ(xt,t)u_\theta(x_t, t)): Responsible for learning the vector field that transfers Gaussian noise to shading latents.

During inference, Enc\text{Enc} processes II to yield a feature map of dimension 256×H/8×W/8256 \times H/8 \times W/8, concatenated with latent noise xtx_t. The result—a 264×H/8×W/8264 \times H/8 \times W/8 tensor—enters the UNet backbone. Skip connections inject intermediate encoder outputs into corresponding UNet layers. The UNet, guided by the latent flow matcher, produces a latent shading code z^1\hat{z}_1, decoded by DD to image space (S^\hat{S}), and albedo is recovered as A=I/S^A = I / \hat{S}.

3. Latent Flow Matching: Mathematical and Training Foundations

Flow matching is formulated as learning a time-continuous vector field vtv_t that advects samples from a simple Gaussian distribution p0p_0 to complex latent targets p1p_1 (shading codes). Specifically, for t[0,1]t \in [0,1]:

  • ODE: dxt=vt(xt)dt\mathrm{d}x_t = v_t(x_t)\,\mathrm{d}t
  • Training Loss:

Lflow=EtU[0,1],xtuθ(xt,t)vt22\mathcal{L}_{\mathrm{flow}} = \mathbb{E}_{t\sim U[0,1],\,x_t}\,\|u_\theta(x_t, t) - v_t\|_2^2

where,

xt=(1(1σmin)t)x0+tx1,vt=x1(1σmin)x0x_t = (1 - (1-\sigma_{\min})t)x_0 + t x_1,\quad v_t = x_1 - (1 - \sigma_{\min})x_0

At inference, x0N(0,I)x_0 \sim \mathcal{N}(0, I) is numerically integrated using a single Euler step, generating z^1\hat{z}_1 for decoding.

4. VAE Latent Encoding, Decoding, and Loss Functions

VAE training on shading s0s_0 involves encoding to z0=E(s0)z_0 = E(s_0) and decoding via D(z0)D(z_0). The objective comprises:

  • Reconstruction Loss:

Lrec=s^0s022\mathcal{L}_{\mathrm{rec}} = \|\hat{s}_0 - s_0\|_2^2

  • Perceptual Loss: Lperc\mathcal{L}_{\mathrm{perc}} (VGG-based feature loss)
  • KL Divergence: LKL\mathcal{L}_{\mathrm{KL}}
  • Adversarial Loss: Ladv\mathcal{L}_{\mathrm{adv}} (lightweight discriminator)

Total loss for the first 90 epochs (no adversary):

LVAE=Lrec+0.005LKL+Lperc\mathcal{L}_{\mathrm{VAE}} = \mathcal{L}_{\mathrm{rec}} + 0.005\,\mathcal{L}_{\mathrm{KL}} + \mathcal{L}_{\mathrm{perc}}

For the subsequent 200 epochs (with adversarial tuning):

LVAE=Lrec+0.005LKL+Lperc+0.1Ladv\mathcal{L}_{\mathrm{VAE}} = \mathcal{L}_{\mathrm{rec}} + 0.005\,\mathcal{L}_{\mathrm{KL}} + \mathcal{L}_{\mathrm{perc}} + 0.1\,\mathcal{L}_{\mathrm{adv}}

5. Parameter Efficiency and Comparative Analysis

FlowIID achieves substantial gains in parameter efficiency:

Model Parameters (Millions) at Inference
FlowIID 51.7 (58.4 incl. VAEGAN training)
Niid-Net 273.1
Careaga & Aksoy (Intrinsic) 252
Careaga & Aksoy (Colorful) 548
RGB⇆X diffusion 1,280

Despite an order-of-magnitude size reduction, FlowIID matches or surpasses the performance of far heavier models (Singla et al., 18 Jan 2026).

6. Quantitative and Qualitative Performance

On the MIT Intrinsic dataset, FlowIID sets benchmark records for both albedo and shading:

Component MSE LMSE DSSIM
Albedo 0.0040 0.0043 0.0435
Shading 0.0109 0.0119 0.0823

On the ARAP dataset (no ARAP-specific finetuning):

Component LMSE RMSE SSIM
Albedo 0.021 0.108 0.760
Shading 0.022 0.132 0.744

Qualitative side-by-side comparisons with Lettry et al., Niid-Net, and Careaga & Aksoy indicate albedo outputs with preserved color fidelity and low texture bleeding, and shading maps displaying smooth, spatially consistent illumination. This suggests robust separation of reflectance and illumination cues even under compact architectural constraints.

7. Ablation Studies and Design Tradeoffs

Ablation analysis on ARAP demonstrates:

  • Removing concatenation of encoder features to the UNet input increases albedo LMSE to 0.0242 and decreases SSIM to 0.744.
  • Increasing UNet depth from four to five MRBs (adding 7.6 million parameters) yields no consistent improvement.

The full model—four MRBs with encoder–UNet concatenation—offers optimal parameter efficiency and best empirical results.

8. Deployment Scenarios and Applications

FlowIID’s single-step, low-parameter inference is well-matched to:

  • Real-time relighting in mobile AR and game engines
  • Material editing on embedded and resource-constrained systems
  • Preprocessing for vision in robotics and autonomous platforms

A plausible implication is increased practical adoption of IID as a standard preprocessing step in low-latency and embedded vision pipelines, given FlowIID’s balance of decomposition fidelity, consistency, and computational footprint.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FlowIID.