LAMS-Edit: Diffusion & Attention Mixing
- LAMS-Edit is a diffusion-based image and style editing framework that integrates latent and attention representations to balance content fidelity with edit controllability.
- It employs scheduler-driven mixing at each denoising step to interpolate inversion and edit prompts, ensuring structural integrity while applying stylistic changes.
- Additional features include region masking and LoRA-based style transfer, offering precise control for localized image modifications and nuanced transformations.
LAMS-Edit is a diffusion-based image and style editing framework that addresses the persistent challenge of balancing content preservation and edit fidelity in real-image editing tasks. By leveraging latent and attention representations derived from the inversion process and interpolating them with those generated under an edit prompt, LAMS-Edit provides a principled mechanism for controlled transformation. The core methodology—Latent and Attention Mixing with Schedulers (LAMS)—enables the framework to maintain structural integrity while applying nuanced or stylistic alterations, integrating region-selective editing and style transfer extensions.
1. Real-Image Inversion and the Editability–Fidelity Problem
Diffusion model inversion is a crucial precursor to real-image editing. Given a pre-trained text-conditional diffusion model such as Stable Diffusion, typical synthesis operates by sampling and performing denoising via with a target prompt for . In contrast, DDIM inversion deterministically approximates the forward noise process: a real image is encoded into , followed by recursive DDIM steps yielding and, particularly, capable of near-exact reconstruction when reversed.
For editing, inversion maps the image into the model's latent space, establishing a basis for prompt-guided manipulation. This mapping is imperfect: starting from and applying Prompt-to-Prompt (P2P) or similar methods often fails to preserve content or produces inadequate edits. The trade-off arises because early denoising stages set coarse structure; later stages inject fine details. Starting purely from noise and a new prompt forfeits alignment, while relying exclusively on the inverted latent resists edit absorption. Achieving optimal balance of content fidelity and semantic editability remains a central modeling challenge (Fu et al., 6 Jan 2026).
2. Latent and Attention Mixing with Schedulers (LAMS) Formulation
The LAMS-Edit methodology exploits the entire inversion trajectory—including both latent codes and cross-attention maps—by weighted mixing at each denoising step, governed by interpretable schedulers.
Mathematical Structure
At step :
- Attention Mixing: Let denote the cross-attention map from inversion, and from denoising with the edit prompt. The attention is interpolated as
which is injected via P2P during denoising.
- Latent Mixing: After performing the denoising step under prompt with mixed attention, let be the inversion latent and the updated latent. Mixing yields
with for the next step.
- Scheduler Parameterization: Each sequence and is specified by:
- start scale
- end scale
- decay-until step
- decay type
For example, the logistic decay for :
Algorithmic Overview
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
function LAMS_Edit(x0, p_orig, p_target, schedA, schedZ):
z0* = E(x0)
{z_t*, A_t*} = DDIM_Inversion(z0*, p_orig)
wA[1..T] = BuildScheduler(schedA)
wZ[0..T–1] = BuildScheduler(schedZ)
ẑ_T = z_T*
for t = T down to 1:
ṽ_z, ṽ_A = DM(ẑ_t, p_orig)
Â_t = ExtractAttention(DM(ẑ_t, p_target))
A_mixed = wA[t]*A_t* + (1–wA[t])*Â_t
ẑ_{t–1} = DM(ẑ_t, p_target) with P2P(ṽ_A, A_mixed)
ẑ_{t–1} = wZ[t–1]*z_{t–1}* + (1–wZ[t–1])*ẑ_{t–1}
end for
return x̂0 = D(ẑ_0) |
3. Integration, Masked Editing, and Style Transfer
LAMS-Edit incorporates several extensions to enhance versatility and control.
- Prompt-to-Prompt (P2P): P2P is applied at each denoising step, leveraging the "mixed" cross-attention activations tied to specific prompt tokens, replacing the single-prompt attention map to enable more precise edits.
- Region Masking (SAM-guided): The Segment-Anything Model (SAM) plus text-based selectors define a binary region-of-interest (ROI) mask . After latent mixing:
This restricts edits to desired regions, with untouched areas reverting to the original latent.
- LoRA-based Style Transfer: LoRA checkpoints are loaded into U-Net weights post-inversion, prior to the reverse pass with mixing. Given LoRA's confinement to denoiser convolutions, it composes without interfering with mixing schedules, supporting one-pass content and style transfer.
4. Scheduler Types, Hyperparameters, and Tuning
Empirical evaluation of scheduler designs demonstrates that schedule shape and parameters critically influence edit balance.
- Decay Types:
- Stepped: abrupt value changes
- Linear:
- Negative exponential:
- Logistic: smooth S-shaped curve
- Default Settings:
| Mixing Type | Start | End | Until | Decay |
|---|---|---|---|---|
| Attention mixing | 0.7 | 0.1 | 50 | logistic |
| Latent mixing | 0.6 | 0.0 | 10 | stepped |
- Tuning Guidelines:
- Edits too weak: decrease or for latent mixing.
- Content warping/identity loss: increase or for attention mixing.
- Decay type is noncritical; linear and logistic both yield effective results.
Analysis indicates optimal scheduler ranges of –$20$, –$50$, , [(Fu et al., 6 Jan 2026), supplement].
5. Empirical Evaluation and Comparative Analysis
Benchmarking employs 100 COCO 2017 images, original and edited text prompts, and qualitative assessment on DALL·E 3 and anime datasets (Anything-V4). Baselines include DiffEdit, Pix2Pix-Zero, SDEdit, Plug-and-Play, NTI + P2P, LEDITS++, and PnPInversion.
- Metrics:
- LPIPS: content fidelity (lower is better)
- CLIP Score: prompt alignment (higher is better)
- FID: additional supplement-only metric
LAMS-Edit exhibits the best trade-off curves for LPIPS vs. CLIP across varying inversion steps, with region masking providing the strongest balance (Fig. "lpips_vs_clipscore"). A user study (15 image–style pairs, 41 raters) found LAMS-Edit with mask to be preferred by approximately 50% of participants for overall quality, compared to less than 30% for DiffStyler and InST (Table "user_study_results").
- Highlights:
- Preserves geometric and facial structure better than prior approaches.
- Enables localized edits via P2P without extraneous artifacts.
- Failure Modes:
- Dramatic edits (e.g., "teapot to flying dragon") can disrupt structure if scheduler parameters are misconfigured.
- Excessive latent mixing can fully reconstruct the original, negating intended edits.
- Ablation Results:
- Attention mixing alone enhances global layout but risks identity loss.
- Latent mixing alone maintains appearance but impedes semantic transformation.
- Combined mixing (LAM) is superior; full LAMS including schedulers yields smooth fidelity-editability trade-off (Figs. "ablation_lams", "ablation_lams_2").
6. Architectural and Implementational Aspects
LAMS-Edit operates primarily on Stable Diffusion v1.5 (photo-realistic) and Anything-V4 (anime), utilizing U-Net backbones with CLIP ViT cross-attention.
- Inference:
- 50 DDIM inversion + 50 LAMS reverse steps
- Guidance scale: 7.5
- Memory: GB (intermediate cache, CPU), GB (GPU)
- Inference time: s per image (TITAN RTX, 24 GB)
- Recommended Practices:
- Employ original prompt (captioner-generated or provided) for inversion.
- Restrict latent mixing to initial steps, then anneal for effective new detail injection.
- Use tightly calibrated ROI masks (SAM with precise "point" prompts recommended).
- LoRA checkpoints are additive to LAMS and require no re-tuning of schedules.
In summary, LAMS-Edit generalizes the inversion-plus-P2P paradigm by systematically mining all intermediate latent and attention states, blending them under control of explicit schedules, and supporting region masks and style transfer. The framework is reproducible with public model checkpoints and provided algorithmic procedures, and achieves state-of-the-art edit-fidelity balance in both global and localized image editing scenarios (Fu et al., 6 Jan 2026).