Masked Cross-Image Attention Sharing

Updated 4 July 2026

The paper introduces masked cross-image attention sharing to enable controlled feature transfer by modifying self-attention with explicit masks.
It is applied in domains like few-shot medical image segmentation, zero-shot appearance transfer, and scene synthesis to preserve semantic correspondence.
Empirical results show improved metrics such as Dice scores and PSNR, evidencing the benefits of iterative refinement and precise masking.

Masked cross-image attention sharing denotes a class of mechanisms in which attention is computed across two or more images, while an explicit mask constrains which features, regions, or statistics are allowed to participate in the exchange. In the literature, the idea appears in several distinct forms: CAT-Net for few-shot medical image segmentation uses a Cross Masked Attention Transformer to enable mutual interaction between support and query while restricting attention to foreground information (Lin et al., 2023); zero-shot appearance transfer replaces self-attention with cross-image attention between structure and appearance images and supplements it with masked AdaIN (Alaluf et al., 2023); example-guided scene synthesis introduces Masked Spatial-Channel Attention to align semantically unaligned scenes through cross-attention and feature masking (Zheng et al., 2019); and InstantFamily applies masked cross-attention to route multiple identity embeddings to distinct spatial regions in diffusion-based generation (Kim et al., 2024). Across these formulations, the recurring objective is to avoid indiscriminate information sharing—background redundancy, semantically incompatible transfer, or identity mixing—while preserving correspondence-aware transfer.

1. Conceptual scope and problem setting

The term does not denote a single canonical operator. Rather, it refers to a family of designs that modify ordinary self-attention so that one image can query, enhance, or synthesize another image under explicit constraints. In CAT-Net, the setting is 1-way 1-shot few-shot medical image segmentation, with a support image $I^s$ and mask $M^s$ , and a query image $I^q$ ; the goal is to transfer useful anatomical and foreground information between them through mutual interaction instead of support-to-query transfer only (Lin et al., 2023). In zero-shot appearance transfer, the setting is a pair of images, one specifying structure and the other specifying appearance, and the model uses cross-image attention during the denoising process of a pretrained latent diffusion model to combine the two without optimization or training (Alaluf et al., 2023).

In scene synthesis, the problem is even less aligned: the exemplar image and the target label map may be structurally uncorrelated and semantically unaligned. The Masked Spatial-Channel Attention module is introduced precisely because a plain global style encoder or unmasked attention would tend to mix incompatible semantics, such as applying sky texture to snow or grass to water (Zheng et al., 2019). In multi-ID image generation, the failure mode is different again: standard cross-attention allows multiple identity conditions to bleed across regions, so masked cross-attention is used to assign each identity to a specific face region while text remains a global condition (Kim et al., 2024).

A plausible synthesis is that masked cross-image attention sharing is best understood as a correspondence-control strategy. The cross-image component determines from where information is drawn; the masking component determines what may be transferred and where it may act.

2. Attention-theoretic formulation

A common starting point is ordinary attention. InstantFamily states the standard form as

$\text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$

where $Q$ comes from noisy image latent features and $K,V$ come from conditioning tokens (Kim et al., 2024). In standard diffusion self-attention, the keys and values come from the same image or latent. Zero-shot appearance transfer makes the cross-image substitution explicit: the output branch keeps the structure queries, but uses the appearance branch’s keys and values, yielding

$\Delta \phi^{\text{cross}} = \text{softmax}\left(\frac{Q_{out}\cdot K_{app}^T}{\sqrt{d}}\right)V_{app}.$

This changes attention from an intra-image operation into an inter-image correspondence mechanism (Alaluf et al., 2023).

CAT-Net instantiates the same idea in a different domain. Before cross-image interaction, support and query are each passed through self-attention, producing flattened features $X^q \in \mathbb{R}^{HW\times D}$ and $X^s \in \mathbb{R}^{HW\times D}$ . The paper defines

$A(Q, K)=\dfrac{QK^T}{\sqrt{d}} \qquad\text{and}\qquad O=\text{softmax}(A)V.$

For the support branch, the cross-attention matrix is then computed as

$M^s$ 0

The intended effect is explicit matching between support and query in feature space, followed by bidirectional enhancement of both branches rather than a one-way prototype transfer (Lin et al., 2023).

MSCA is architecturally different but conceptually similar. It is a two-stage cross-attention module with an explicit masking step in between. At each scale $M^s$ 1, it first computes spatial attention over the exemplar scene,

$M^s$ 2

and aggregates exemplar content by

$M^s$ 3

It then reassembles masked exemplar content into a target-aligned feature map through channel attention conditioned on the target label features (Zheng et al., 2019). This suggests that cross-image attention sharing need not be a single $M^s$ 4- $M^s$ 5- $M^s$ 6 substitution; it can also be a multi-stage attention pipeline that separates region extraction, masking, and redistribution.

3. The masking operation and its variants

The defining property of the topic is not merely that information flows across images, but that the flow is constrained. In CAT-Net, the constraints are foreground masks derived from segmentation. The masked cross attention map is written as

$M^s$ 7

The paper states that the binary query mask is expanded and flattened to limit the foreground region in the attention map, and that the masks $M^s$ 8, $M^s$ 9, and, in iterative inference, a dilated $I^q$ 0, are used to restrict attention to foreground, construct prototypes from foreground-only support pixels, and refine the next iteration’s query features (Lin et al., 2023). The double-threshold strategy,

$I^q$ 1

with $I^q$ 2 and $I^q$ 3, is specifically used so that uncertain foreground pixels are not discarded too early.

MSCA uses masking at a different stage. After spatially aggregating $I^q$ 4 exemplar region vectors, it gates them using global semantic context from both label maps:

$I^q$ 5

The masking function is a 2-layer MLP followed by a sigmoid, and the operation is explicitly compared to Squeeze-and-Excitation, but here it suppresses exemplar features that are semantically irrelevant to the target scene (Zheng et al., 2019). This is a form of semantic gating rather than literal spatial foreground masking.

InstantFamily masks the cross-attention score matrix directly. It constructs

$I^q$ 6

where $I^q$ 7 is a text mask and $I^q$ 8 is the mask for the $I^q$ 9-th face. During training, each face mask is a binary mask over the $\text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$ 0 image space,

$\text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$ 1

and is resized to $\text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$ 2, $\text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$ 3, $\text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$ 4, and $\text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$ 5. The face region is expanded by a 25% margin around the detected face box (Kim et al., 2024). The mask therefore serves as a region-gated routing mechanism: text conditions remain global, while each face token group is restricted to its intended spatial region.

In zero-shot appearance transfer, the most literal “masked” component is not the cross-image attention operator itself but the normalization pathway. AdaIN is applied as

$\text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$ 6

and because AdaIN is sensitive to object size, the method restricts it to the foreground object using an object mask produced with the unsupervised self-segmentation method from Patashnik et al. (Alaluf et al., 2023). This shows that “masking” in masked cross-image attention sharing can apply either to attention logits, to attended features, or to image-statistics alignment coupled to attention-driven synthesis.

4. Architectural realizations across domains

The mechanism is instantiated within distinct system architectures, each tailored to a different transfer problem.

Paper	Core architecture	Masked sharing placement
CAT-Net (Lin et al., 2023)	MIFE, CMAT, iterative refinement	Foreground-constrained support-query interaction
Zero-shot appearance transfer (Alaluf et al., 2023)	Stable Diffusion denoising with inversion	Decoder self-attention replaced at selected layers and timesteps
MSCA (Zheng et al., 2019)	Feature extraction, feature alignment, SPADE synthesis	Multi-scale spatial attention, feature masking, channel attention
InstantFamily (Kim et al., 2024)	SD1.5, ControlNet, face encoder, OpenPose	Masked cross-attention in UNet and ControlNet

CAT-Net contains three main components: Mask Incorporated Feature Extraction, the Cross Masked Attention Transformer, and an iterative refinement framework. The support mask is pooled with the support feature and concatenated back; the query mask prediction from the current stage is also concatenated with query features; a simple classifier gives an initial query mask; and CMAT is then applied repeatedly so that the predicted query mask from one iteration is fed into the next (Lin et al., 2023). The final model uses four CMAT blocks as a trade-off between accuracy and efficiency.

The diffusion-based appearance-transfer formulation is selective in both depth and time. Standard self-attention in the U-Net decoder is replaced at resolutions $\text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$ 7 and $\text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$ 8, and in the appendix the injection is applied only during timesteps 10 to 70 for $\text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$ 9 and 10 to 90 for $Q$ 0 (Alaluf et al., 2023). Outside those intervals, the layer behaves like standard self-attention. The same work also describes an optional structure injection variant in which, in some intervals, the output keys and values are replaced with those of the structure image rather than the appearance image.

MSCA is applied at all scales $Q$ 1 because both global color tone and local appearances are informative for style-constraint synthesis. The aligned exemplar features $Q$ 2 and target semantic features $Q$ 3 are then fed into a SPADE-based synthesis module at each scale (Zheng et al., 2019). This multi-scale deployment allows coarse scene attributes and fine-grained local texture to be aligned within the same framework.

InstantFamily is built on Stable Diffusion 1.5 (SD1.5), ControlNet, a face encoder, and an OpenPose-based pose control image. Its multimodal embedding stack concatenates text embedding $Q$ 4 with stacked multi-ID face embeddings $Q$ 5,

$Q$ 6

where each face contributes both a global feature $Q$ 7 and a local feature $Q$ 8, projected into a face token sequence $Q$ 9 (Kim et al., 2024). The masked cross-attention mechanism is then applied in both UNet and ControlNet.

The training and inference behavior of these systems reveals that masked cross-image attention sharing is often embedded in a larger control loop rather than used as a single isolated operator. CAT-Net defines iterative refinement as

$K,V$ 0

with

$K,V$ 1

and

$K,V$ 2

Each iteration enhances support and query features using masked cross attention, produces query prediction and a new dilated mask, and feeds $K,V$ 3 into the next round (Lin et al., 2023). A plausible implication is that masking becomes progressively more informative as the query estimate becomes more reliable.

MSCA addresses the lack of paired exemplar-target data with patch-based self-supervision. Two non-overlapping patches are sampled from the same scene, and the network is trained with self-reconstruction and cross-reconstruction objectives from four synthesized outputs (Zheng et al., 2019). This training protocol uses only semantically parsed images and does not require video data.

The zero-shot appearance-transfer method does not train at all. It inverts both input images into latent space using edit-friendly DDPM inversion, initializes the output latent with the structure latent, and denoises the appearance and output branches in parallel while replacing decoder self-attention layers with cross-image attention (Alaluf et al., 2023). To stabilize raw cross-image attention, it adds attention map contrasting,

$K,V$ 4

with $K,V$ 5, and appearance guidance,

$K,V$ 6

so that denoising is biased toward appearance transfer while retaining the stabilizing effect of self-attention.

InstantFamily is trained with a fixed $K,V$ 7 faces, random stacking order of face embeddings, and pose and mask extraction from multi-face images (Kim et al., 2024). At inference time, the identity image or images and the pose control image can come from different sources, and the model is zero-shot for new identities. The paper explicitly states that it can handle more identities than it was trained with, which the authors attribute to the scalable embedding stack and masking strategy.

6. Empirical evidence, limitations, and recurring interpretations

The empirical record across these papers treats masking not as an auxiliary detail but as the component that makes cross-image sharing usable. CAT-Net reports Dice scores on three public datasets. In Setting 2, the method achieves 70.88% on Abd-CT, 75.22% on Abd-MRI, and 79.36% on Card-MRI (Lin et al., 2023). Its ablation study shows S $K,V$ 8 Q at 66.72 Dice, Q $K,V$ 9 S at 65.98 Dice, and S $\Delta \phi^{\text{cross}} = \text{softmax}\left(\frac{Q_{out}\cdot K_{app}^T}{\sqrt{d}}\right)V_{app}.$ 0 Q at 68.62 Dice; with iterative refinement these become 67.68, 66.54, and 70.88, respectively. This supports the claim that mutual masked interaction plus iteration is superior to one-way transfer or non-iterative inference.

MSCA’s ablation table similarly shows that the full model outperforms both a global-average-pooling variant and a version without feature masking. On the duplicating setting, the full method reaches 16.50 PSNR / 0.40 SSIM / 0.420 LPIPS / 84.93 FID (Zheng et al., 2019). The “MSCA w/o att” ablation performs dramatically worse, and the “MSCA w/o fm” ablation performs noticeably worse than the full model, which the paper interprets as evidence that both attention and masking are essential for semantically coherent transfer across arbitrary scenes.

In zero-shot appearance transfer, the reported findings are more strongly qualitative but remain consistent with the same pattern. Baseline cross-image attention alone gives semantic transfer but artifacts; contrast improves locality and reduces artifacts; AdaIN improves color and style alignment; and appearance guidance improves final quality further (Alaluf et al., 2023). Quantitatively, the work evaluates structure preservation via IoU of masks and appearance fidelity via Gram-matrix distance, and reports a strong balance between the two, competitive overall performance, and user preference for appearance fidelity and overall quality.

InstantFamily reports improvements in both single-ID and multi-ID preservation. Its single-ID identity preservation is 0.799 ± 0.086, and for the proposed multi-ID metric it improves from FastComposer: 1.392 ± 0.319 to InstantFamily: 1.620 ± 0.153 (Kim et al., 2024). The paper attributes this to reduced identity mixing, better separation across multiple subjects, and better control over composition.

Several misconceptions are corrected by these results. Masked cross-image attention sharing is not simply global style transfer; in these papers it is correspondence-aware and often part-by-part. It is not necessarily a one-way mechanism; CAT-Net explicitly studies S $\Delta \phi^{\text{cross}} = \text{softmax}\left(\frac{Q_{out}\cdot K_{app}^T}{\sqrt{d}}\right)V_{app}.$ 1 Q, Q $\Delta \phi^{\text{cross}} = \text{softmax}\left(\frac{Q_{out}\cdot K_{app}^T}{\sqrt{d}}\right)V_{app}.$ 2 S, and S $\Delta \phi^{\text{cross}} = \text{softmax}\left(\frac{Q_{out}\cdot K_{app}^T}{\sqrt{d}}\right)V_{app}.$ 3 Q, and the best performance comes from the bidirectional variant (Lin et al., 2023). It is also not restricted to masking attention matrices alone: MSCA masks aggregated exemplar features, and the appearance-transfer method masks AdaIN statistics rather than the attention operator itself (Zheng et al., 2019, Alaluf et al., 2023).

The limitations are likewise domain-specific. Zero-shot appearance transfer notes that if two objects do not share meaningful semantics, transfer is harder; performance depends on inversion quality; and the random seed can affect editability and final result (Alaluf et al., 2023). InstantFamily reports that pose estimation errors can lead to bad anatomy, faces near image edges can be problematic, identity mixing is improved but not eliminated, and self-attention may still contribute to mixing (Kim et al., 2024). These limitations suggest that masking reduces spurious sharing but does not eliminate all failure modes arising from weak correspondence, imperfect control signals, or competing latent dynamics.