LAMS-Edit: Diffusion & Attention Mixing

Updated 13 January 2026

LAMS-Edit is a diffusion-based image and style editing framework that integrates latent and attention representations to balance content fidelity with edit controllability.
It employs scheduler-driven mixing at each denoising step to interpolate inversion and edit prompts, ensuring structural integrity while applying stylistic changes.
Additional features include region masking and LoRA-based style transfer, offering precise control for localized image modifications and nuanced transformations.

LAMS-Edit is a diffusion-based image and style editing framework that addresses the persistent challenge of balancing content preservation and edit fidelity in real-image editing tasks. By leveraging latent and attention representations derived from the inversion process and interpolating them with those generated under an edit prompt, LAMS-Edit provides a principled mechanism for controlled transformation. The core methodology—Latent and Attention Mixing with Schedulers (LAMS)—enables the framework to maintain structural integrity while applying nuanced or stylistic alterations, integrating region-selective editing and style transfer extensions.

1. Real-Image Inversion and the Editability–Fidelity Problem

Diffusion model inversion is a crucial precursor to real-image editing. Given a pre-trained text-conditional diffusion model such as Stable Diffusion, typical synthesis operates by sampling $z_T \sim \mathcal{N}(0,I)$ and performing denoising via $z_{t-1} \leftarrow DM(z_t, p)$ with a target prompt $p$ for $t = T \ldots 1$ . In contrast, DDIM inversion deterministically approximates the forward noise process: a real image $x_0$ is encoded into $z_0 = E(x_0)$ , followed by recursive DDIM steps yielding $\{z^*_t\}_{t=0}^T$ and, particularly, $z^*_T$ capable of near-exact reconstruction when reversed.

For editing, inversion maps the image into the model's latent space, establishing a basis for prompt-guided manipulation. This mapping is imperfect: starting from $z^*_T$ and applying Prompt-to-Prompt (P2P) or similar methods often fails to preserve content or produces inadequate edits. The trade-off arises because early denoising stages set coarse structure; later stages inject fine details. Starting purely from noise and a new prompt forfeits alignment, while relying exclusively on the inverted latent resists edit absorption. Achieving optimal balance of content fidelity and semantic editability remains a central modeling challenge (Fu et al., 6 Jan 2026).

2. Latent and Attention Mixing with Schedulers (LAMS) Formulation

The LAMS-Edit methodology exploits the entire inversion trajectory—including both latent codes and cross-attention maps—by weighted mixing at each denoising step, governed by interpretable schedulers.

Mathematical Structure

At step $t$ :

Attention Mixing: Let $A^*_t$ denote the cross-attention map from inversion, and $\hat{A}_t$ from denoising with the edit prompt. The attention is interpolated as

$\tilde{A}^{\text{mixed}}_t = w^A_t \cdot A^*_t + (1 - w^A_t) \cdot \hat{A}_t$

which is injected via P2P during denoising.

Latent Mixing: After performing the denoising step under prompt $p$ with mixed attention, let $z^*_{t-1}$ be the inversion latent and $\hat{z}_{t-1}$ the updated latent. Mixing yields

$\bar{z}_{t-1} = w^z_{t-1} \cdot z^*_{t-1} + (1 - w^z_{t-1}) \cdot \hat{z}_{t-1}$

with $\hat{z}_{t-1} \leftarrow \bar{z}_{t-1}$ for the next step.

Scheduler Parameterization: Each sequence $\{w^A_t\}$ ${w_{t}^{A}}$ and $\{w^z_t\}$ ${w_{t}^{z}}$ is specified by:
- start scale $s_{\text{start}} \in [0,1]$
- end scale $s_{\text{end}} \in [0,1]$
- decay-until step $s_{\text{until}} \in [1,T]$
- decay type $\in \{\text{stepped}, \text{linear}, \text{negative-exp}, \text{logistic}\}$

For example, the logistic decay for $w_t$ :

$w_t = s_{\text{end}} + \frac{s_{\text{start}} - s_{\text{end}}}{1 + \exp\left(\frac{t - s_{\text{mid}}}{k}\right)}$

Algorithmic Overview

function LAMS_Edit(x0, p_orig, p_target, schedA, schedZ):
    z0* = E(x0)
    {z_t*, A_t*} = DDIM_Inversion(z0*, p_orig)
    wA[1..T] = BuildScheduler(schedA)
    wZ[0..T–1] = BuildScheduler(schedZ)
    ẑ_T = z_T*
    for t = T down to 1:
        ṽ_z, ṽ_A = DM(ẑ_t, p_orig)
        Â_t = ExtractAttention(DM(ẑ_t, p_target))
        A_mixed = wA[t]*A_t* + (1–wA[t])*Â_t
        ẑ_{t–1} = DM(ẑ_t, p_target) with P2P(ṽ_A, A_mixed)
        ẑ_{t–1} = wZ[t–1]*z_{t–1}* + (1–wZ[t–1])*ẑ_{t–1}
    end for
    return x̂0 = D(ẑ_0)

3. Integration, Masked Editing, and Style Transfer

LAMS-Edit incorporates several extensions to enhance versatility and control.

Prompt-to-Prompt (P2P): P2P is applied at each denoising step, leveraging the "mixed" cross-attention activations tied to specific prompt tokens, replacing the single-prompt attention map to enable more precise edits.
Region Masking (SAM-guided): The Segment-Anything Model (SAM) plus text-based selectors define a binary region-of-interest (ROI) mask $M$ . After latent mixing:

$\hat{z}_{t-1} = M \odot \bar{z}_{t-1} + (1 - M) \odot z^*_{t-1}$

This restricts edits to desired regions, with untouched areas reverting to the original latent.

LoRA-based Style Transfer: LoRA checkpoints are loaded into U-Net weights post-inversion, prior to the reverse pass with mixing. Given LoRA's confinement to denoiser convolutions, it composes without interfering with mixing schedules, supporting one-pass content and style transfer.

4. Scheduler Types, Hyperparameters, and Tuning

Empirical evaluation of scheduler designs demonstrates that schedule shape and parameters critically influence edit balance.

Decay Types:
- Stepped: abrupt value changes
- Linear: $w_t = s_{\text{start}} - \frac{t}{T}(s_{\text{start}} - s_{\text{end}})$
- Negative exponential: $w_t = s_{\text{end}} + (s_{\text{start}} - s_{\text{end}})\exp(-t/\tau)$
- Logistic: smooth S-shaped curve
Default Settings:

Mixing Type	Start	End	Until	Decay
Attention mixing	0.7	0.1	50	logistic
Latent mixing	0.6	0.0	10	stepped

Tuning Guidelines:
- Edits too weak: decrease $s_{\text{end}}$ or $s_{\text{until}}$ for latent mixing.
- Content warping/identity loss: increase $s_{\text{start}}$ or $s_{\text{until}}$ for attention mixing.
- Decay type is noncritical; linear and logistic both yield effective results.

Analysis indicates optimal scheduler ranges of $z_{\text{until}} \approx 10$ –$20$, $A_{\text{until}} \approx 20$ –$50$, $s_{z, \text{end}} \approx 0$ , $s_{A, \text{start}} \geq 0.4$ [(Fu et al., 6 Jan 2026), supplement].

5. Empirical Evaluation and Comparative Analysis

Benchmarking employs 100 COCO 2017 images, original and edited text prompts, and qualitative assessment on DALL·E 3 and anime datasets (Anything-V4). Baselines include DiffEdit, Pix2Pix-Zero, SDEdit, Plug-and-Play, NTI + P2P, LEDITS++, and PnPInversion.

Metrics:
- LPIPS: content fidelity (lower is better)
- CLIP Score: prompt alignment (higher is better)
- FID: additional supplement-only metric

LAMS-Edit exhibits the best trade-off curves for LPIPS vs. CLIP across varying inversion steps, with region masking providing the strongest balance (Fig. "lpips_vs_clipscore"). A user study (15 image–style pairs, 41 raters) found LAMS-Edit with mask to be preferred by approximately 50% of participants for overall quality, compared to less than 30% for DiffStyler and InST (Table "user_study_results").

Highlights:
- Preserves geometric and facial structure better than prior approaches.
- Enables localized edits via P2P without extraneous artifacts.
Failure Modes:
- Dramatic edits (e.g., "teapot to flying dragon") can disrupt structure if scheduler parameters are misconfigured.
- Excessive latent mixing can fully reconstruct the original, negating intended edits.
Ablation Results:
- Attention mixing alone enhances global layout but risks identity loss.
- Latent mixing alone maintains appearance but impedes semantic transformation.
- Combined mixing (LAM) is superior; full LAMS including schedulers yields smooth fidelity-editability trade-off (Figs. "ablation_lams", "ablation_lams_2").

6. Architectural and Implementational Aspects

LAMS-Edit operates primarily on Stable Diffusion v1.5 (photo-realistic) and Anything-V4 (anime), utilizing U-Net backbones with CLIP ViT cross-attention.

Inference:
- 50 DDIM inversion + 50 LAMS reverse steps
- Guidance scale: 7.5
- Memory: $\sim12$ GB (intermediate cache, CPU), $\sim24$ GB (GPU)
- Inference time: $\sim4.5$ s per image (TITAN RTX, 24 GB)
Recommended Practices:
- Employ original prompt $p_{\text{orig}}$ (captioner-generated or provided) for inversion.
- Restrict latent mixing to initial $\sim10$ steps, then anneal for effective new detail injection.
- Use tightly calibrated ROI masks (SAM with precise "point" prompts recommended).
- LoRA checkpoints are additive to LAMS and require no re-tuning of schedules.

In summary, LAMS-Edit generalizes the inversion-plus-P2P paradigm by systematically mining all intermediate latent and attention states, blending them under control of explicit schedules, and supporting region masks and style transfer. The framework is reproducible with public model checkpoints and provided algorithmic procedures, and achieves state-of-the-art edit-fidelity balance in both global and localized image editing scenarios (Fu et al., 6 Jan 2026).

Markdown Upgrade to Chat

References (1)

LAMS-Edit: Latent and Attention Mixing with Schedulers for Improved Content Preservation in Diffusion-Based Image and Style Editing (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LAMS-Edit.