AttnRouter: Per-Category Attention Routing for Training-Free Image Editing on MMDiT

Published 2 May 2026 in cs.CV | (2605.01480v1)

Abstract: We study training-free image editing on Qwen-Image-Edit-2511, a 60-block multi-modal diffusion transformer (MMDiT) that concatenates noise and source-image tokens within a single attention stream. We make three contributions. (i) We introduce KVInject, a single-forward attention manipulation that alpha-blends source-half key/value projections into the noise-half within a localized layer/step band. KVInject is simpler than the classical two-pass MasaCtrl recipe and avoids the prompt-mismatch failure mode that disables MasaCtrl on MMDiT (composite score drops 31% versus baseline). (ii) We show that no single attention operation dominates across edit types, motivating AttnRouter, a per-category routing table that dispatches edits to the operation that best preserves source structure for that type. With ground-truth categories the router improves the CLIP-T+DINO-I composite by 6.4% over the editing baseline; an automatic CLIP zero-shot classifier closes 98% of this gap despite only 55% category accuracy. (iii) Through layer-, step-, and alpha-band ablations we localize the editing-effective attention sub-circuit: K/V injection in early denoising steps (S0-7) recovers nearly all of the gain of full-step injection, while injection in early (L0-15) or late (L45-60) layer bands fails to drive editing entirely; alpha in [0.3, 0.5] is a stable sweet spot. We also report negative results that highlight what does not transfer from the UNet folklore: simple K/V rescaling never beats baseline and aggressive variants collapse generation entirely (composite 0.084). We release code, pre-computed routing tables, and a 100-sample stratified subset of ImgEdit-Bench used in all ablations.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper introduces KVInject and AttnRouter, enabling training-free, high-fidelity image editing through per-category attention manipulation in MMDiT.
KVInject uses an α-blend operator to mix source and noise key/value projections within targeted layer and step bands, boosting composite scores by up to 3.8%.
AttnRouter employs a CLIP-based auto-classifier for routing different edit categories, achieving a 6.4% performance gain over traditional methods.

Per-Category Attention Routing for Training-Free Image Editing on MMDiT

Introduction

The paper "AttnRouter: Per-Category Attention Routing for Training-Free Image Editing on MMDiT" (2605.01480) rigorously characterizes attention-level interventions for training-free image editing in multi-modal diffusion transformers (MMDiTs). The authors focus on Qwen-Image-Edit-2511—a 60-block transformer where noise and source-image tokens are concatenated, flowing through a single joint-attention backbone. This architecture complicates the transfer of prior training-free editing methods rooted in UNet design, as classic separation between self- and cross-attention is abolished. The work pioneers a new attention manipulation protocol, KVInject, and introduces AttnRouter—a category-aware dispatcher that achieves superior preservation and fidelity outcomes across diverse edit types.

Figure 1: End-to-end pipeline: tokens from source image and noise are encoded in a shared latent, concatenated for joint attention; KVInject blends keys/values in noise stream toward source within a local layer/step band; AttnRouter selects configuration per edit category.

Methods

KVInject: Attention-Level $\alpha$ -Blend Operator

KVInject replaces classical two-pass attention control by blending source-half key/value projections into the noise-half with an $\alpha$ -weighted scheme, restricted to configurable layer and step bands. Unlike MasaCtrl, which relies on attention map caching in a neutral prompt forward, KVInject deploys the edit-prompt forward for blending, leveraging semantically aligned source token representations. This direct approach eliminates brittle dependency on extra passes and decouples editing efficacy from prompt matching.

Algorithmically, for a given band, KVInject operates:

$K_{\text{noise}}' = \alpha K_{\text{src}} + (1-\alpha) K_{\text{noise}}$

applied only to the noise stream, maintaining text-token projections untouched.

AttnRouter: Per-Category Operation Routing

Extensive empirical evaluation demonstrates that attention-manipulation preferences are highly category-dependent. AttnRouter systematizes routing across edit categories via a discrete lookup table mapping category labels to optimal KVInject configurations or to no-edit baselines. Deployment leverages an auto-classifier—based on CLIP zero-shot prediction—which closely approximates oracle routing performance due to high route overlap across confusable categories.

Routing decisions are made at inference using stratified anchor sentence centroids in CLIP text space, yielding robust and efficient test-time operation selection without retraining or gradient search.

Empirical Results

Composite Metric Improvement and Ablations

Main results indicate that:

KVInject with optimal configuration (typically $\alpha=0.3$ , layers 30--45) increases composite score by 3.8%.
AttnRouter (oracle) achieves a 6.4% gain, while auto-routing with CLIP classifier (55% accuracy) attains 98% of oracle performance, owing to route-sharing among categories.

Classical MasaCtrl, transplanted from UNet, degrades composite score by 31% due to the neutral prompt's failure in yielding useful K/V for injection in MMDiT.

Figure 2: CLIP-T vs. DINO-I scatter: AttnRouter (blue stars) reaches highest iso-composite contours; MasaCtrl (brown plus) collapses along source preservation axis.

Ablation studies further resolve:

Editing effectiveness is sharply localized to layers 30--45 and denoising steps 0--7; injection outside these bands produces near-identity images or fails to enact edits.
Composite scores peak for $\alpha$ in $[0.3, 0.5]$ , with over-injection ( $\alpha=0.7$ ) causing prompt-aligned degeneration.
Figure 3: $\alpha$ -sweep curves indicate optimal editing at $\alpha=0.3$ in layers 30--45; excessive blending collapses source preservation.

Figure 4: Step-band ablation reveals that gains are confined to early denoising steps S0--7; later steps return to baseline.

Comparative Attention Visualization

KVInject produces structured noise-to-source attention maps, with strong diagonals and regionally salient off-diagonal peaks. Cosine similarity between noise and source keys/values dips most in layers 30--45, aligning with the quantitative ablation.

Figure 5: Schematic visualization: baseline maps diffuse; KVInject induces noise-to-source correspondence; maximal divergence in layers 30--45.

Per-Category Analysis

Router-based configuration yields highest gains for style edits (+24%), modest for attribute/background, and negligible for add edits. Injecting source K/V is counterproductive for add, as it suppresses content insertion; hence, baselines are preferred.

Figure 6: Per-category composite improvement: style edits benefit most from AttnRouter, add edits show no improvement.

Qualitative results substantiate router's superior preservation of background/identity in style, attribute, and background edits versus baseline or single static operation.

Figure 7: Qualitative comparison grid: AttnRouter maintains source structure in edits where baseline fails.

Discussion

The findings confirm that MMDiT fundamentally alters training-free editing—requiring manipulation of tightly co-attended source and noise streams rather than orthogonal self-/cross-attention levers. KVInject's same-forward blending exploits source token semantic alignment in the edit context, eliminating failure modes associated with neutral prompt inversion. The per-category routing paradigm outperforms any single operator, revealing operational heterogeneity across edit types.

Negative results, including MasaCtrl underperformance and K/V rescaling collapse, highlight that UNet-era folklore is non-transferable to transformer-based diffusion. The robust efficacy of AttnRouter with imperfect classifiers advances test-time adaptability for editors, pointing to practical deployment routes.

Avenues for future research include regionally masked KVInject variants, per-step $\alpha$ scheduling, and differentiable router heads, which enable finer adaptation and performance improvement. Extension to other MMDiT backbones and evaluation on additional benchmarks will further cement these insights.

Conclusion

The paper provides a systematic blueprint for training-free image editing in transformer-based diffusion models, introducing KVInject for efficient attention blending and AttnRouter for category-aware operation routing. Structured ablations localize editing utility to specific layers and steps, and per-category routing achieves significant gains in composite metrics, with rapid auto-classification approximating oracle performance. The insights and protocols outlined redefine attention manipulation for image editing in MMDiT architectures and establish practical, robust avenues moving forward.

Markdown Report Issue