Papers
Topics
Authors
Recent
Search
2000 character limit reached

LAMS-Edit: Diffusion & Attention Mixing

Updated 13 January 2026
  • LAMS-Edit is a diffusion-based image and style editing framework that integrates latent and attention representations to balance content fidelity with edit controllability.
  • It employs scheduler-driven mixing at each denoising step to interpolate inversion and edit prompts, ensuring structural integrity while applying stylistic changes.
  • Additional features include region masking and LoRA-based style transfer, offering precise control for localized image modifications and nuanced transformations.

LAMS-Edit is a diffusion-based image and style editing framework that addresses the persistent challenge of balancing content preservation and edit fidelity in real-image editing tasks. By leveraging latent and attention representations derived from the inversion process and interpolating them with those generated under an edit prompt, LAMS-Edit provides a principled mechanism for controlled transformation. The core methodology—Latent and Attention Mixing with Schedulers (LAMS)—enables the framework to maintain structural integrity while applying nuanced or stylistic alterations, integrating region-selective editing and style transfer extensions.

1. Real-Image Inversion and the Editability–Fidelity Problem

Diffusion model inversion is a crucial precursor to real-image editing. Given a pre-trained text-conditional diffusion model such as Stable Diffusion, typical synthesis operates by sampling zTN(0,I)z_T \sim \mathcal{N}(0,I) and performing denoising via zt1DM(zt,p)z_{t-1} \leftarrow DM(z_t, p) with a target prompt pp for t=T1t = T \ldots 1. In contrast, DDIM inversion deterministically approximates the forward noise process: a real image x0x_0 is encoded into z0=E(x0)z_0 = E(x_0), followed by recursive DDIM steps yielding {zt}t=0T\{z^*_t\}_{t=0}^T and, particularly, zTz^*_T capable of near-exact reconstruction when reversed.

For editing, inversion maps the image into the model's latent space, establishing a basis for prompt-guided manipulation. This mapping is imperfect: starting from zTz^*_T and applying Prompt-to-Prompt (P2P) or similar methods often fails to preserve content or produces inadequate edits. The trade-off arises because early denoising stages set coarse structure; later stages inject fine details. Starting purely from noise and a new prompt forfeits alignment, while relying exclusively on the inverted latent resists edit absorption. Achieving optimal balance of content fidelity and semantic editability remains a central modeling challenge (Fu et al., 6 Jan 2026).

2. Latent and Attention Mixing with Schedulers (LAMS) Formulation

The LAMS-Edit methodology exploits the entire inversion trajectory—including both latent codes and cross-attention maps—by weighted mixing at each denoising step, governed by interpretable schedulers.

Mathematical Structure

At step tt:

  • Attention Mixing: Let AtA^*_t denote the cross-attention map from inversion, and A^t\hat{A}_t from denoising with the edit prompt. The attention is interpolated as

A~tmixed=wtAAt+(1wtA)A^t\tilde{A}^{\text{mixed}}_t = w^A_t \cdot A^*_t + (1 - w^A_t) \cdot \hat{A}_t

which is injected via P2P during denoising.

  • Latent Mixing: After performing the denoising step under prompt pp with mixed attention, let zt1z^*_{t-1} be the inversion latent and z^t1\hat{z}_{t-1} the updated latent. Mixing yields

zˉt1=wt1zzt1+(1wt1z)z^t1\bar{z}_{t-1} = w^z_{t-1} \cdot z^*_{t-1} + (1 - w^z_{t-1}) \cdot \hat{z}_{t-1}

with z^t1zˉt1\hat{z}_{t-1} \leftarrow \bar{z}_{t-1} for the next step.

  • Scheduler Parameterization: Each sequence {wtA}\{w^A_t\} and {wtz}\{w^z_t\} is specified by:
    • start scale sstart[0,1]s_{\text{start}} \in [0,1]
    • end scale send[0,1]s_{\text{end}} \in [0,1]
    • decay-until step suntil[1,T]s_{\text{until}} \in [1,T]
    • decay type {stepped,linear,negative-exp,logistic}\in \{\text{stepped}, \text{linear}, \text{negative-exp}, \text{logistic}\}

For example, the logistic decay for wtw_t:

wt=send+sstartsend1+exp(tsmidk)w_t = s_{\text{end}} + \frac{s_{\text{start}} - s_{\text{end}}}{1 + \exp\left(\frac{t - s_{\text{mid}}}{k}\right)}

Algorithmic Overview

1
2
3
4
5
6
7
8
9
10
11
12
13
14
function LAMS_Edit(x0, p_orig, p_target, schedA, schedZ):
    z0* = E(x0)
    {z_t*, A_t*} = DDIM_Inversion(z0*, p_orig)
    wA[1..T] = BuildScheduler(schedA)
    wZ[0..T1] = BuildScheduler(schedZ)
    ẑ_T = z_T*
    for t = T down to 1:
        ṽ_z, ṽ_A = DM(ẑ_t, p_orig)
        Â_t = ExtractAttention(DM(ẑ_t, p_target))
        A_mixed = wA[t]*A_t* + (1wA[t])*Â_t
        ẑ_{t1} = DM(ẑ_t, p_target) with P2P(ṽ_A, A_mixed)
        ẑ_{t1} = wZ[t1]*z_{t1}* + (1wZ[t1])*ẑ_{t1}
    end for
    return x̂0 = D(ẑ_0)

3. Integration, Masked Editing, and Style Transfer

LAMS-Edit incorporates several extensions to enhance versatility and control.

  • Prompt-to-Prompt (P2P): P2P is applied at each denoising step, leveraging the "mixed" cross-attention activations tied to specific prompt tokens, replacing the single-prompt attention map to enable more precise edits.
  • Region Masking (SAM-guided): The Segment-Anything Model (SAM) plus text-based selectors define a binary region-of-interest (ROI) mask MM. After latent mixing:

z^t1=Mzˉt1+(1M)zt1\hat{z}_{t-1} = M \odot \bar{z}_{t-1} + (1 - M) \odot z^*_{t-1}

This restricts edits to desired regions, with untouched areas reverting to the original latent.

  • LoRA-based Style Transfer: LoRA checkpoints are loaded into U-Net weights post-inversion, prior to the reverse pass with mixing. Given LoRA's confinement to denoiser convolutions, it composes without interfering with mixing schedules, supporting one-pass content and style transfer.

4. Scheduler Types, Hyperparameters, and Tuning

Empirical evaluation of scheduler designs demonstrates that schedule shape and parameters critically influence edit balance.

  • Decay Types:
    • Stepped: abrupt value changes
    • Linear: wt=sstarttT(sstartsend)w_t = s_{\text{start}} - \frac{t}{T}(s_{\text{start}} - s_{\text{end}})
    • Negative exponential: wt=send+(sstartsend)exp(t/τ)w_t = s_{\text{end}} + (s_{\text{start}} - s_{\text{end}})\exp(-t/\tau)
    • Logistic: smooth S-shaped curve
  • Default Settings:
Mixing Type Start End Until Decay
Attention mixing 0.7 0.1 50 logistic
Latent mixing 0.6 0.0 10 stepped
  • Tuning Guidelines:
    • Edits too weak: decrease sends_{\text{end}} or suntils_{\text{until}} for latent mixing.
    • Content warping/identity loss: increase sstarts_{\text{start}} or suntils_{\text{until}} for attention mixing.
    • Decay type is noncritical; linear and logistic both yield effective results.

Analysis indicates optimal scheduler ranges of zuntil10z_{\text{until}} \approx 10–$20$, Auntil20A_{\text{until}} \approx 20–$50$, sz,end0s_{z, \text{end}} \approx 0, sA,start0.4s_{A, \text{start}} \geq 0.4 [(Fu et al., 6 Jan 2026), supplement].

5. Empirical Evaluation and Comparative Analysis

Benchmarking employs 100 COCO 2017 images, original and edited text prompts, and qualitative assessment on DALL·E 3 and anime datasets (Anything-V4). Baselines include DiffEdit, Pix2Pix-Zero, SDEdit, Plug-and-Play, NTI + P2P, LEDITS++, and PnPInversion.

  • Metrics:
    • LPIPS: content fidelity (lower is better)
    • CLIP Score: prompt alignment (higher is better)
    • FID: additional supplement-only metric

LAMS-Edit exhibits the best trade-off curves for LPIPS vs. CLIP across varying inversion steps, with region masking providing the strongest balance (Fig. "lpips_vs_clipscore"). A user study (15 image–style pairs, 41 raters) found LAMS-Edit with mask to be preferred by approximately 50% of participants for overall quality, compared to less than 30% for DiffStyler and InST (Table "user_study_results").

  • Highlights:
    • Preserves geometric and facial structure better than prior approaches.
    • Enables localized edits via P2P without extraneous artifacts.
  • Failure Modes:
    • Dramatic edits (e.g., "teapot to flying dragon") can disrupt structure if scheduler parameters are misconfigured.
    • Excessive latent mixing can fully reconstruct the original, negating intended edits.
  • Ablation Results:
    • Attention mixing alone enhances global layout but risks identity loss.
    • Latent mixing alone maintains appearance but impedes semantic transformation.
    • Combined mixing (LAM) is superior; full LAMS including schedulers yields smooth fidelity-editability trade-off (Figs. "ablation_lams", "ablation_lams_2").

6. Architectural and Implementational Aspects

LAMS-Edit operates primarily on Stable Diffusion v1.5 (photo-realistic) and Anything-V4 (anime), utilizing U-Net backbones with CLIP ViT cross-attention.

  • Inference:
    • 50 DDIM inversion + 50 LAMS reverse steps
    • Guidance scale: 7.5
    • Memory: 12\sim12 GB (intermediate cache, CPU), 24\sim24 GB (GPU)
    • Inference time: 4.5\sim4.5 s per image (TITAN RTX, 24 GB)
  • Recommended Practices:
    • Employ original prompt porigp_{\text{orig}} (captioner-generated or provided) for inversion.
    • Restrict latent mixing to initial 10\sim10 steps, then anneal for effective new detail injection.
    • Use tightly calibrated ROI masks (SAM with precise "point" prompts recommended).
    • LoRA checkpoints are additive to LAMS and require no re-tuning of schedules.

In summary, LAMS-Edit generalizes the inversion-plus-P2P paradigm by systematically mining all intermediate latent and attention states, blending them under control of explicit schedules, and supporting region masks and style transfer. The framework is reproducible with public model checkpoints and provided algorithmic procedures, and achieves state-of-the-art edit-fidelity balance in both global and localized image editing scenarios (Fu et al., 6 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LAMS-Edit.