LumiX: Structured Diffusion for Intrinsic Maps
- LumiX is a structured diffusion framework that generates coordinated intrinsic scene maps (albedo, irradiance, normal, depth, color) from text or images.
- It employs Query-Broadcast Attention for consistent spatial alignment and Tensor LoRA for efficient, cross-map parameter adaptation.
- Experimental results show LumiX outperforms prior methods with a +23% improvement in alignment and enhanced perceptual quality metrics.
LumiX refers to a structured diffusion framework designed for coherent text-to-intrinsic generation, in which a single model, conditioned on a text prompt, generates a coordinated set of intrinsic scene maps: albedo, irradiance (shading), normal, depth, and final color. LumiX achieves this via two primary architectural components—Query-Broadcast Attention and Tensor LoRA—which, respectively, ensure structural consistency and parameter-efficient, cross-map adaptation. Empirical assessments demonstrate substantial improvements in both structural and perceptual metrics compared to prior state-of-the-art, alongside natural support for image-conditioned intrinsic decomposition (Han et al., 2 Dec 2025).
1. Problem Statement and Motivation
The core goal is to generate a set of intrinsic image properties—color, albedo, irradiance, depth, normal—from either textual descriptions or direct image conditioning, with all outputs remaining physically consistent and spatially aligned. Prior approaches using independent or loosely coupled generation pipelines (e.g., text-to-image models followed by per-task postprocessing) tend to exhibit drift or inconsistency across maps, compromising physical plausibility. LumiX addresses this limitation by enforcing shared structural cues and explicitly modeling joint cross-map correlations during the diffusion process, using a unified architecture that aligns both appearance and scene geometry.
2. System Design and Architecture
LumiX adopts a frozen text-to-image latent diffusion model (FLUX.1-dev, U-Net backbone) where all original weights and Q/K/V projections are fixed. Lightweight, trainable adapters (Tensor LoRA) are inserted into the K/V branches of each self-attention layer. The key mechanism, Query-Broadcast Attention, modifies standard multi-map self-attention by using the color map’s queries for all maps, but leaving keys and values map-specific. This design localizes shared structure while permitting modality-specific modeling. Each intrinsic property is encoded as a latent via a pretrained autoencoder, and batched jointly (M=5) during both training and inference.
3. Diffusion Training Formulation
LumiX employs the standard latent diffusion and flow-matching paradigm, operating in the latent space of a VAE encoder. For each intrinsic map latent , the forward (noising) SDE is
The denoising network predicts the velocity (score) field, parameterized as
The training objective is an MSE (flow matching) loss:
Key design: each intrinsic map uses an independent timestep schedule , enforcing disentanglement while incentivizing latent consistency.
4. Query-Broadcast Attention and Tensor LoRA
Query-Broadcast Attention
- Standard per-map self-attention is defined as .
- In LumiX, all maps utilize the color map’s Q:
- This configuration ensures all maps localize objects identically, enforcing alignment, while K/V encode the modality-specific features (e.g., geometry, reflectance).
Tensor LoRA
- Traditional LoRA adds rank-R updates to frozen weights , with .
- For maps and K/V projections per map, parameter cost scales as for separate adaptation.
- Tensor LoRA introduces a 4th-order tensor update with decomposition:
where (output-core), (map-core), and (coupling-core) compress the fused map-adapter space. With , substantial parameter efficiency is achieved without loss of expressiveness.
Compared ablations show Tensor LoRA outperforms both separate and fused block updates, with optimality for alignment and parameter cost at $2.34$M/block.
5. Training and Inference Procedures
Data and Conditioning
- Training uses a Hypersim subset with photorealistic scenes, each providing HDR color, albedo, shading, normal, and depth. Images are tone-mapped and gamma-corrected.
- For text conditioning, BLIP-2 generates single-line captions.
Optimization
- The base U-Net and VAE weights remain frozen.
- Only the Tensor LoRA adapters inserted in the K/V branches are trained, via the mean flow-matching loss across all maps.
- Training uses Prodigy optimizer at learning rate $1.0$, batch size $16$, on 4A100 80GB for roughly 40 hours (10K steps).
Inference Modes
- Text-to-intrinsic: for all , followed by joint denoising to sample all maps from prompt.
- Image-to-intrinsic: Fix one latent (e.g., color) with ; denoise others conditionally to obtain consistent decomposition.
6. Experimental Results and Ablations
Quantitative results, summarized:
| Method | Alignment ↑ | ImageReward ↑ | PickScore ↑ | Avg ↑ |
|---|---|---|---|---|
| IntrinsiX (vanilla LoRA) | 6.73 | -0.37 | 19.83 | -0.41 |
| LumiX (Q-Broadcast+Tensor) | 8.30 | 0.45 | 21.02 | 0.19 |
LumiX outperforms prior art by on alignment and shifts human-preference (ImageReward/PickScore average) from to $0.19$, with lower per-block parameter and compute costs (Han et al., 2 Dec 2025).
For zero-shot albedo decomposition (ARAP [10]):
| Method | T2I | RMSE ↓ | SSIM ↑ |
|---|---|---|---|
| Colorful Shading [12] | ✗ | 0.149 | 0.796 |
| IID [30] | ✗ | 0.160 | 0.738 |
| IntrinsicAnything [39] | ✗ | 0.171 | 0.692 |
| LumiX | ✔ | 0.165 | 0.753 |
In-the-wild decomposition over 50 real photos:
| Method | ImageReward ↑ | PickScore ↑ |
|---|---|---|
| RGB↔X [40] | –0.20 | 20.01 |
| Colorful Shading [12] | 0.06 | 20.03 |
| LumiX | 0.14 | 20.16 |
Ablation analyses reveal:
- Query-Broadcast produces higher alignment than per-map/tuned Q ().
- Lower Tensor LoRA rank () marginally drops performance ($7.86$ alignment) but remains superior to competing layouts.
- Separate LoRA yields high per-map image quality but poor cross-map alignment ($4.40$).
- Hybrid and fused LoRA offer intermediate trade-offs.
7. Image-Conditioned Intrinsic Decomposition and Practical Capabilities
By setting the latent code for a single property (e.g., color) to its clean VAE encoding () during inference, and diffusing the remaining properties, LumiX enables robust intrinsic decomposition from real images. Query-Broadcast ensures geometric and textural properties remain tightly coupled to image structure. Comparative results show cleaner albedo, more plausible normals, and physically consistent decomposition in both synthetic (Hypersim) and natural (wild-captured) scenarios when compared to RGB↔X [40] and Colorful Shading [12].
Flexible editing becomes possible: arbitrary combinations of fixed and generated intrinsic maps, or text-augmented editing of material/lighting attributes, can be realized within the same pipeline framework.
LumiX, through the integration of Query-Broadcast Attention and Tensor LoRA, establishes a parameter-efficient, unified structured diffusion backbone that advances joint text-conditioned and image-conditioned intrinsic scene generation, delivering state-of-the-art consistency and preference scores across multiple benchmarks while supporting rigorous and flexible scene manipulation (Han et al., 2 Dec 2025).