LumiX: Structured Diffusion for Intrinsic Maps

Updated 9 December 2025

LumiX is a structured diffusion framework that generates coordinated intrinsic scene maps (albedo, irradiance, normal, depth, color) from text or images.
It employs Query-Broadcast Attention for consistent spatial alignment and Tensor LoRA for efficient, cross-map parameter adaptation.
Experimental results show LumiX outperforms prior methods with a +23% improvement in alignment and enhanced perceptual quality metrics.

LumiX refers to a structured diffusion framework designed for coherent text-to-intrinsic generation, in which a single model, conditioned on a text prompt, generates a coordinated set of intrinsic scene maps: albedo, irradiance (shading), normal, depth, and final color. LumiX achieves this via two primary architectural components—Query-Broadcast Attention and Tensor LoRA—which, respectively, ensure structural consistency and parameter-efficient, cross-map adaptation. Empirical assessments demonstrate substantial improvements in both structural and perceptual metrics compared to prior state-of-the-art, alongside natural support for image-conditioned intrinsic decomposition (Han et al., 2 Dec 2025).

1. Problem Statement and Motivation

The core goal is to generate a set of intrinsic image properties—color, albedo, irradiance, depth, normal—from either textual descriptions or direct image conditioning, with all outputs remaining physically consistent and spatially aligned. Prior approaches using independent or loosely coupled generation pipelines (e.g., text-to-image models followed by per-task postprocessing) tend to exhibit drift or inconsistency across maps, compromising physical plausibility. LumiX addresses this limitation by enforcing shared structural cues and explicitly modeling joint cross-map correlations during the diffusion process, using a unified architecture that aligns both appearance and scene geometry.

2. System Design and Architecture

LumiX adopts a frozen text-to-image latent diffusion model (FLUX.1-dev, U-Net backbone) where all original weights and Q/K/V projections are fixed. Lightweight, trainable adapters (Tensor LoRA) are inserted into the K/V branches of each self-attention layer. The key mechanism, Query-Broadcast Attention, modifies standard multi-map self-attention by using the color map’s queries for all maps, but leaving keys and values map-specific. This design localizes shared structure while permitting modality-specific modeling. Each intrinsic property is encoded as a latent via a pretrained autoencoder, and batched jointly (M=5) during both training and inference.

3. Diffusion Training Formulation

LumiX employs the standard latent diffusion and flow-matching paradigm, operating in the latent space of a VAE encoder. For each intrinsic map latent $z^{(m)}$ , the forward (noising) SDE is

$dz = \sqrt{2}\,dW, \quad z(t) = z_0 + \sqrt{t}\,\epsilon, \quad \epsilon \sim \mathcal N(0,I)$

The denoising network $v_\theta(z_t, t, \mathcal C)$ predicts the velocity (score) field, parameterized as

$v_\theta(z_t, t, \mathcal C) \approx \mathbb E[\epsilon|z_t, t, \mathcal C] - z_0$

The training objective is an MSE (flow matching) loss:

$\mathcal L(\theta) = \mathbb E_{\substack{t\sim \text{Unif}(0,1) \ z_0\sim\mathcal D \ \epsilon\sim\mathcal N(0,I)}} \| v_\theta(z_t, t, \mathcal C) - (\epsilon - z_0) \|^2_2$

Key design: each intrinsic map uses an independent timestep schedule $t^{(m)} \sim \text{Uniform}[0,1]$ , enforcing disentanglement while incentivizing latent consistency.

4. Query-Broadcast Attention and Tensor LoRA

Query-Broadcast Attention

Standard per-map self-attention is defined as $H^{(m)} = \text{softmax}(Q^{(m)} K^{(m)\intercal}/\sqrt{d}) V^{(m)}$ .
In LumiX, all maps utilize the color map’s Q:

$H^{(m)} = \text{softmax}(Q^{(c)} K^{(m)\intercal}/\sqrt{d}) V^{(m)}$

This configuration ensures all maps localize objects identically, enforcing alignment, while K/V encode the modality-specific features (e.g., geometry, reflectance).

Tensor LoRA

Traditional LoRA adds rank-R updates to frozen weights $W \leftarrow W + AB^\top$ , with $A, B \in \mathbb R^{d \times R}$ .
For $M$ maps and K/V projections per map, parameter cost scales as $2Md^2$ for separate adaptation.
Tensor LoRA introduces a 4th-order tensor update with decomposition:

$\Delta[i,o,j,i'] = \sum_{\alpha_1=1}^{R_1} \sum_{\alpha_2=1}^{R_2} A[i,o,\alpha_1] B[i,j,\alpha_2] C[i,i',\alpha_1,\alpha_2]$

where $A$ (output-core), $B$ (map-core), and $C$ (coupling-core) compress the fused map-adapter space. With $R_1=R_2=8$ , substantial parameter efficiency is achieved without loss of expressiveness.

Compared ablations show Tensor LoRA outperforms both separate and fused block updates, with optimality for alignment and parameter cost at $2.34$M/block.

5. Training and Inference Procedures

Data and Conditioning

Training uses a Hypersim subset with $\sim3,000$ photorealistic scenes, each providing HDR color, albedo, shading, normal, and depth. Images are tone-mapped and gamma-corrected.
For text conditioning, BLIP-2 generates single-line captions.

Optimization

The base U-Net and VAE weights remain frozen.
Only the Tensor LoRA adapters inserted in the K/V branches are trained, via the mean flow-matching loss across all maps.
Training uses Prodigy optimizer at learning rate $1.0$, batch size $16$, on 4 $\times$ A100 80GB for roughly 40 hours (10K steps).

Inference Modes

Text-to-intrinsic: $z^{(m)}_T \sim \mathcal{N}(0,I)$ for all $m$ , followed by joint denoising to sample all maps from prompt.
Image-to-intrinsic: Fix one latent (e.g., color) with $t=0$ ; denoise others conditionally to obtain consistent decomposition.

6. Experimental Results and Ablations

Quantitative results, summarized:

Method	Alignment ↑	ImageReward ↑	PickScore ↑	Avg ↑
IntrinsiX (vanilla LoRA)	6.73	-0.37	19.83	-0.41
LumiX (Q-Broadcast+Tensor)	8.30	0.45	21.02	0.19

LumiX outperforms prior art by $+23\%$ on alignment and shifts human-preference (ImageReward/PickScore average) from $-0.41$ to $0.19$, with lower per-block parameter and compute costs (Han et al., 2 Dec 2025).

For zero-shot albedo decomposition (ARAP [10]):

Method	T2I	RMSE ↓	SSIM ↑
Colorful Shading [12]	✗	0.149	0.796
IID [30]	✗	0.160	0.738
IntrinsicAnything [39]	✗	0.171	0.692
LumiX	✔	0.165	0.753

In-the-wild decomposition over 50 real photos:

Method	ImageReward ↑	PickScore ↑
RGB↔X [40]	–0.20	20.01
Colorful Shading [12]	0.06	20.03
LumiX	0.14	20.16

Ablation analyses reveal:

Query-Broadcast produces higher alignment than per-map/tuned Q ( $8.30 \rightarrow 7.14$ ).
Lower Tensor LoRA rank ( $R=4$ ) marginally drops performance ($7.86$ alignment) but remains superior to competing layouts.
Separate LoRA yields high per-map image quality but poor cross-map alignment ($4.40$).
Hybrid and fused LoRA offer intermediate trade-offs.

7. Image-Conditioned Intrinsic Decomposition and Practical Capabilities

By setting the latent code for a single property (e.g., color) to its clean VAE encoding ( $t=0$ ) during inference, and diffusing the remaining properties, LumiX enables robust intrinsic decomposition from real images. Query-Broadcast ensures geometric and textural properties remain tightly coupled to image structure. Comparative results show cleaner albedo, more plausible normals, and physically consistent decomposition in both synthetic (Hypersim) and natural (wild-captured) scenarios when compared to RGB↔X [40] and Colorful Shading [12].

Flexible editing becomes possible: arbitrary combinations of fixed and generated intrinsic maps, or text-augmented editing of material/lighting attributes, can be realized within the same pipeline framework.

LumiX, through the integration of Query-Broadcast Attention and Tensor LoRA, establishes a parameter-efficient, unified structured diffusion backbone that advances joint text-conditioned and image-conditioned intrinsic scene generation, delivering state-of-the-art consistency and preference scores across multiple benchmarks while supporting rigorous and flexible scene manipulation (Han et al., 2 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

LumiX: Structured and Coherent Text-to-Intrinsic Generation (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to LumiX.