TokenLight: Precise Lighting Control in Images using Attribute Tokens

Published 16 Apr 2026 in cs.CV and cs.GR | (2604.15310v1)

Abstract: This paper presents a method for image relighting that enables precise and continuous control over multiple illumination attributes in a photograph. We formulate relighting as a conditional image generation task and introduce attribute tokens to encode distinct lighting factors such as intensity, color, ambient illumination, diffuse level, and 3D light positions. The model is trained on a large-scale synthetic dataset with ground-truth lighting annotations, supplemented by a small set of real captures to enhance realism and generalization. We validate our approach across a variety of relighting tasks, including controlling in-scene lighting fixtures and editing environment illumination using virtual light sources, on synthetic and real images. Our method achieves state-of-the-art quantitative and qualitative performance compared to prior work. Remarkably, without explicit inverse rendering supervision, the model exhibits an inherent understanding of how light interacts with scene geometry, occlusion, and materials, yielding convincing lighting effects even in traditionally challenging scenarios such as placing lights within objects or relighting transparent materials plausibly. Project page: vrroom.github.io/tokenlight/

Abstract PDF Upgrade to Chat

Authors (8)

Summary

The paper presents a diffusion-based framework that tokenizes lighting attributes for precise image relighting without requiring explicit 3D inverse rendering.
It leverages a large-scale synthetic dataset to enforce physically consistent control over ambient, diffuse, and point light effects in varied scenes.
Experiments show significant improvements in relighting metrics (e.g., PSNR, SSIM) and user preference compared to environment map-based methods.

TokenLight: Diffusion-Based Precise Lighting Control via Attribute Tokens

Introduction

TokenLight is presented as a framework for highly controlled image relighting, introducing a compact, tokenized factorization of lighting parameters for 2D images. Unlike prior paradigms relying on environment maps, global text prompts, or explicit 3D inverse-rendering decomposition, TokenLight directly encodes physically meaningful illumination attributes as tokens—specifically, per-light intensity, color, diffuseness, and 3D position—enabling precise and intuitive controllability. Relighting is learned as conditional generation via a large-scale, synthetically annotated dataset, with the objective of directly mapping scene appearance and desired lighting deltas to plausible, physically consistent relit outputs. This approach circumvents conventional inverse rendering, instead relying on the generative capacity of diffusion transformers pre-trained on large-scale text-conditioned image/video corpora and fine-tuned for lighting manipulation tasks.

Data Generation, Lighting Controls, and Scene Parameterization

TokenLight’s effectiveness is predicated on a deeply supervised, large-scale synthetic dataset where controllable per-light edits are available as ground truth. 3D scenes are rendered with varied combinations of:

Ambient scale (global environment intensity modulation),
Diffuse spread (controlling shadow softness),
Point light insertion (location, color, intensity, radius/diffuseness), and
In-scene fixture masking (segmentation and continuous per-light edits).

This approach leverages path-traced renders and procedurally generated assets to ensure coverage of challenging geometric and photometric effects needed for robust generalization. Lighting parameters are encoded in a canonical, scene-agnostic coordinate system; the parameterization remains invariant under similarity transformations, guaranteeing that user controls in image space have predictable geometric interpretation regardless of 3D scene composition. Figure 1 illustrates diverse editing controls and data supervision strategy.

Figure 1: Data construction showing granular, supervised control over per-light renders, ambient intensity, diffuse spread, and in-scene light fixture masking.

The viewpoint-agnostic camera/light parameter mapping is formalized with explicit scene transformations (scale, translation, rotation). Figure 2 conveys the unification of camera and light attributes in canonical space and the equivalence of visual effects under arbitrary placement.

Figure 2: Scene-agnostic camera and lighting parameterization via canonical reference transformation, ensuring consistent visual effects irrespective of scene geometry or layout.

Methodology and Model Architecture

Image relighting in TokenLight is structured as a conditional generation problem: $I_r \sim p_\theta(\cdot | I, \Delta L)$ where $I$ is the input image and $\Delta L$ is a set of attribute tokens encoding the desired illumination adjustment. Scalar attributes are Fourier embedded, spatial attributes tokenized per component, and light masks VA-encoded to yield a unified conditioning sequence. This sequence is processed in tandem with latent image tokens using a transformer-based diffusion backbone (see Figure 3).

Figure 3: Architecture overview; image tokens and lighting tokens jointly inform the denoising transformer for edit-aware generation.

The conditional diffusion model is trained with a flow-matching objective, optimizing straight-line velocity prediction between Gaussian noise and target image latent, guided by paired lighting-attribute deltas. Importantly, no explicit geometry/material estimation is performed; generalization arises purely from structured data coverage and the semantic content of the tokenized attributes.

Spatial Lighting Control and Quantitative Evaluation

TokenLight establishes a new standard on synthetic and real benchmarks targeting precise spatially localized lighting edits. Key experimental findings include:

On synthetic object relighting benchmarks, TokenLight achieves significantly higher PSNR and SSIM, and lower LPIPS compared to environment map–based alternatives such as Neural Gaffer and DiffusionRenderer (e.g., 21.98 PSNR vs. 16.76).
Sensitivity/accuracy analysis (via confusion matrix) demonstrates the model’s high spatial disentanglement: $B/A$ substantially outperforms Neural Gaffer (1.88 vs. 1.11), indicating robust positional awareness (see Figure 4).
Figure 5: Quantitative comparison on synthetic data, confirming superior spatial relighting accuracy relative to environment map–based methods.

Figure 4: Light-position trajectory and confusion analysis, showing TokenLight's superior localization precision and minimal positional confusion.
User preference studies indicate strong perceptual superiority, with 77.5%–89.2% favoring TokenLight over state-of-the-art baselines across a diverse set of edits.

Real-World Fixture Control and Visible Fixture-60

TokenLight’s framework extends to in-scene relighting of real images, leveraging ground-truth paired captures with controlled fixture on/off states and explicit mask annotations.

On the VisibleFixture-60 test set, the method outperforms spatial scribble-based baselines (e.g., achieving 0.85 SSIM and 0.28 LPIPS vs. ScribbleLight's 0.52/0.61).
TokenLight accurately disentangles fixture emission from global ambient and correctly modulates both shadowing and secondary effects, as seen in the plausible cast shadows and specular highlights (see Figure 6).
Figure 6: High-fidelity fixture relighting—TokenLight enables toggling scene light sources while preserving reflections, shadows, and indirect illumination.

Qualitative Comparisons and In-the-Wild Control

Systematic visual comparisons against GenLit and Careaga et al. highlight TokenLight's robustness and semantic coherence:

Unlike GenLit, which exhibits spatial drift and scene-camera entanglement, TokenLight's scene-agnostic parameterization provides stable lighting control across perspectives (see Figure 7).
The method handles challenging materials (transparency, hair, fur, specularity) with plausible physicality unattainable by two-stage or direct-intrinsic baselines (see Figures 7, 12).

Figure 8: Continuous spatial relighting and complex in-the-wild control: TokenLight synthesizes expected occlusion, shadowing, and volumetric effects via simple 3D token edits.

Figure 7: TokenLight maintains stable top-light placement and consistent shadow orientation across different scene poses—unlike GenLit.

Advanced Lighting Controls: Diffuse, Ambient, and Multi-Light

TokenLight supports global ambient scaling and fine-grained diffuse level manipulation:

Progressive reduction of ambient illumination cleanly separates masked source effects from global illumination (Figure 9).
Diffuse control enables both softening and sharpening of shadows (Figures 17 and 18).
Multi-light control is structured via repeated attribute tokens allowing independent sweeps and color mixing, supporting complex photographic lighting setups without explicit 3D scene recovery (Figure 10).
Figure 9: Continuous ambient scaling demonstrating separation between fixture reflections and global illumination in image space.

Figure 11: Increasing diffuse levels create softer shadows via global ambient-light diffusion control.

Figure 12: Negative diffuse control yields progressively sharper shadows, reversing diffuse softening as needed.

Figure 10: Multi-light, multi-trajectory control: Each light's movement produces predictable, disentangled lighting in a single inference pass.

Limitations and Future Directions

TokenLight inherits the limitations of computational scaling, requiring substantial resources for real-time or long-horizon video relighting; future work may incorporate model distillation and video-aware conditioning [(2604.15310), dmd2, consistency_models]. Stochastic seed variability results in subtle, but non-deterministic, shadow placement at extreme diffuse levels. Generalization to outdoor scenes is reduced due to an indoor-biased training corpus. Extension to spatiotemporal (video) relighting is non-trivial, opening questions regarding canonical coordinate persistence, camera/object tracking, and frame-consistent editing.

Conclusion

TokenLight demonstrates that attribute tokenization—when grounded in physically interpretable lighting parameters and supported by comprehensive synthetic supervision—enables precise, continuous, and semantically meaningful image relighting not previously achievable without explicit 3D scene analysis. The disentangled design offers a path to more accessible, granular lighting control in 2D image editing workflows, bridging the functional gap between interactive 3D DCCs and photorealistic single-image manipulation. Future research avenues include model acceleration, robust outdoor generalization, and scalable video relighting via temporally persistent canonical parameterizations.

Reference: "TokenLight: Precise Lighting Control in Images using Attribute Tokens" (2604.15310).

Markdown Report Issue