Papers
Topics
Authors
Recent
2000 character limit reached

LumiX: Structured Diffusion for Intrinsic Maps

Updated 9 December 2025
  • LumiX is a structured diffusion framework that generates coordinated intrinsic scene maps (albedo, irradiance, normal, depth, color) from text or images.
  • It employs Query-Broadcast Attention for consistent spatial alignment and Tensor LoRA for efficient, cross-map parameter adaptation.
  • Experimental results show LumiX outperforms prior methods with a +23% improvement in alignment and enhanced perceptual quality metrics.

LumiX refers to a structured diffusion framework designed for coherent text-to-intrinsic generation, in which a single model, conditioned on a text prompt, generates a coordinated set of intrinsic scene maps: albedo, irradiance (shading), normal, depth, and final color. LumiX achieves this via two primary architectural components—Query-Broadcast Attention and Tensor LoRA—which, respectively, ensure structural consistency and parameter-efficient, cross-map adaptation. Empirical assessments demonstrate substantial improvements in both structural and perceptual metrics compared to prior state-of-the-art, alongside natural support for image-conditioned intrinsic decomposition (Han et al., 2 Dec 2025).

1. Problem Statement and Motivation

The core goal is to generate a set of intrinsic image properties—color, albedo, irradiance, depth, normal—from either textual descriptions or direct image conditioning, with all outputs remaining physically consistent and spatially aligned. Prior approaches using independent or loosely coupled generation pipelines (e.g., text-to-image models followed by per-task postprocessing) tend to exhibit drift or inconsistency across maps, compromising physical plausibility. LumiX addresses this limitation by enforcing shared structural cues and explicitly modeling joint cross-map correlations during the diffusion process, using a unified architecture that aligns both appearance and scene geometry.

2. System Design and Architecture

LumiX adopts a frozen text-to-image latent diffusion model (FLUX.1-dev, U-Net backbone) where all original weights and Q/K/V projections are fixed. Lightweight, trainable adapters (Tensor LoRA) are inserted into the K/V branches of each self-attention layer. The key mechanism, Query-Broadcast Attention, modifies standard multi-map self-attention by using the color map’s queries for all maps, but leaving keys and values map-specific. This design localizes shared structure while permitting modality-specific modeling. Each intrinsic property is encoded as a latent via a pretrained autoencoder, and batched jointly (M=5) during both training and inference.

3. Diffusion Training Formulation

LumiX employs the standard latent diffusion and flow-matching paradigm, operating in the latent space of a VAE encoder. For each intrinsic map latent z(m)z^{(m)}, the forward (noising) SDE is

dz=2dW,z(t)=z0+tϵ,ϵN(0,I)dz = \sqrt{2}\,dW, \quad z(t) = z_0 + \sqrt{t}\,\epsilon, \quad \epsilon \sim \mathcal N(0,I)

The denoising network vθ(zt,t,C)v_\theta(z_t, t, \mathcal C) predicts the velocity (score) field, parameterized as

vθ(zt,t,C)E[ϵzt,t,C]z0v_\theta(z_t, t, \mathcal C) \approx \mathbb E[\epsilon|z_t, t, \mathcal C] - z_0

The training objective is an MSE (flow matching) loss:

L(θ)=EtUnif(0,1) z0D ϵN(0,I)vθ(zt,t,C)(ϵz0)22\mathcal L(\theta) = \mathbb E_{\substack{t\sim \text{Unif}(0,1) \ z_0\sim\mathcal D \ \epsilon\sim\mathcal N(0,I)}} \| v_\theta(z_t, t, \mathcal C) - (\epsilon - z_0) \|^2_2

Key design: each intrinsic map uses an independent timestep schedule t(m)Uniform[0,1]t^{(m)} \sim \text{Uniform}[0,1], enforcing disentanglement while incentivizing latent consistency.

4. Query-Broadcast Attention and Tensor LoRA

Query-Broadcast Attention

  • Standard per-map self-attention is defined as H(m)=softmax(Q(m)K(m)/d)V(m)H^{(m)} = \text{softmax}(Q^{(m)} K^{(m)\intercal}/\sqrt{d}) V^{(m)}.
  • In LumiX, all maps utilize the color map’s Q:

H(m)=softmax(Q(c)K(m)/d)V(m)H^{(m)} = \text{softmax}(Q^{(c)} K^{(m)\intercal}/\sqrt{d}) V^{(m)}

  • This configuration ensures all maps localize objects identically, enforcing alignment, while K/V encode the modality-specific features (e.g., geometry, reflectance).

Tensor LoRA

  • Traditional LoRA adds rank-R updates to frozen weights WW+ABW \leftarrow W + AB^\top, with A,BRd×RA, B \in \mathbb R^{d \times R}.
  • For MM maps and K/V projections per map, parameter cost scales as 2Md22Md^2 for separate adaptation.
  • Tensor LoRA introduces a 4th-order tensor update with decomposition:

Δ[i,o,j,i]=α1=1R1α2=1R2A[i,o,α1]B[i,j,α2]C[i,i,α1,α2]\Delta[i,o,j,i'] = \sum_{\alpha_1=1}^{R_1} \sum_{\alpha_2=1}^{R_2} A[i,o,\alpha_1] B[i,j,\alpha_2] C[i,i',\alpha_1,\alpha_2]

where AA (output-core), BB (map-core), and CC (coupling-core) compress the fused map-adapter space. With R1=R2=8R_1=R_2=8, substantial parameter efficiency is achieved without loss of expressiveness.

Compared ablations show Tensor LoRA outperforms both separate and fused block updates, with optimality for alignment and parameter cost at $2.34$M/block.

5. Training and Inference Procedures

Data and Conditioning

  • Training uses a Hypersim subset with 3,000\sim3,000 photorealistic scenes, each providing HDR color, albedo, shading, normal, and depth. Images are tone-mapped and gamma-corrected.
  • For text conditioning, BLIP-2 generates single-line captions.

Optimization

  • The base U-Net and VAE weights remain frozen.
  • Only the Tensor LoRA adapters inserted in the K/V branches are trained, via the mean flow-matching loss across all maps.
  • Training uses Prodigy optimizer at learning rate $1.0$, batch size $16$, on 4×\timesA100 80GB for roughly 40 hours (10K steps).

Inference Modes

  • Text-to-intrinsic: zT(m)N(0,I)z^{(m)}_T \sim \mathcal{N}(0,I) for all mm, followed by joint denoising to sample all maps from prompt.
  • Image-to-intrinsic: Fix one latent (e.g., color) with t=0t=0; denoise others conditionally to obtain consistent decomposition.

6. Experimental Results and Ablations

Quantitative results, summarized:

Method Alignment ↑ ImageReward ↑ PickScore ↑ Avg ↑
IntrinsiX (vanilla LoRA) 6.73 -0.37 19.83 -0.41
LumiX (Q-Broadcast+Tensor) 8.30 0.45 21.02 0.19

LumiX outperforms prior art by +23%+23\% on alignment and shifts human-preference (ImageReward/PickScore average) from 0.41-0.41 to $0.19$, with lower per-block parameter and compute costs (Han et al., 2 Dec 2025).

For zero-shot albedo decomposition (ARAP [10]):

Method T2I RMSE ↓ SSIM ↑
Colorful Shading [12] 0.149 0.796
IID [30] 0.160 0.738
IntrinsicAnything [39] 0.171 0.692
LumiX 0.165 0.753

In-the-wild decomposition over 50 real photos:

Method ImageReward ↑ PickScore ↑
RGB↔X [40] –0.20 20.01
Colorful Shading [12] 0.06 20.03
LumiX 0.14 20.16

Ablation analyses reveal:

  • Query-Broadcast produces higher alignment than per-map/tuned Q (8.307.148.30 \rightarrow 7.14).
  • Lower Tensor LoRA rank (R=4R=4) marginally drops performance ($7.86$ alignment) but remains superior to competing layouts.
  • Separate LoRA yields high per-map image quality but poor cross-map alignment ($4.40$).
  • Hybrid and fused LoRA offer intermediate trade-offs.

7. Image-Conditioned Intrinsic Decomposition and Practical Capabilities

By setting the latent code for a single property (e.g., color) to its clean VAE encoding (t=0t=0) during inference, and diffusing the remaining properties, LumiX enables robust intrinsic decomposition from real images. Query-Broadcast ensures geometric and textural properties remain tightly coupled to image structure. Comparative results show cleaner albedo, more plausible normals, and physically consistent decomposition in both synthetic (Hypersim) and natural (wild-captured) scenarios when compared to RGB↔X [40] and Colorful Shading [12].

Flexible editing becomes possible: arbitrary combinations of fixed and generated intrinsic maps, or text-augmented editing of material/lighting attributes, can be realized within the same pipeline framework.


LumiX, through the integration of Query-Broadcast Attention and Tensor LoRA, establishes a parameter-efficient, unified structured diffusion backbone that advances joint text-conditioned and image-conditioned intrinsic scene generation, delivering state-of-the-art consistency and preference scores across multiple benchmarks while supporting rigorous and flexible scene manipulation (Han et al., 2 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to LumiX.