Papers
Topics
Authors
Recent
Search
2000 character limit reached

PACMAC: Attention-Conditioned Masking in Transformers

Updated 10 February 2026
  • PACMAC is a family of techniques that uses transformer attention maps to guide selective masking, refining input focus without extra network complexity.
  • In vision tasks, it employs a two-stage unsupervised domain adaptation strategy with attention consistency, achieving significant accuracy gains on benchmarks like DomainNet.
  • For summarization, PACMAC applies inference-time masking to restrict attention to salient tokens, enhancing ROUGE scores and mitigating non-salient copying.

Attention-Conditioned Masking (PACMAC) encompasses a family of techniques that exploit transformer attention distributions to inform dynamic input masking or selective content biasing. Distinct PACMAC variants have been proposed for different modalities and tasks, including unsupervised domain adaptation for self-supervised vision transformers (ViTs) and inference-time content selection in encoder–decoder transformers for abstractive summarization. These approaches leverage the structure of learned attention maps—either to probe consistency under attentive region masking for self-training, or to block attention to non-salient source positions to force sharper content selection—yielding robust empirical gains with minimal architectural or optimization complexity.

1. PACMAC in Vision Transformers for Unsupervised Domain Adaptation

PACMAC as introduced in self-supervised ViT adaptation is a two-stage unsupervised domain adaptation algorithm specialized for models pretrained under self-supervised learning (SSL) objectives. The method is designed for transfer learning scenarios where domain shift exists between labeled source data and unlabeled target data (Prabhu et al., 2022).

1.1 Two-Stage Adaptation Algorithm

  • Stage 1: In-domain SSL “Warm-up” The Vision Transformer encoder is re-adapted by further pretraining on the union of source and target images using an SSL loss. This loss can take the form of reconstructing masked patches (e.g., MAE) or enforcing invariance between augmented views (e.g., DINO). The target is to align SSL pretext objectives with the requirements of the downstream classification task across both domains:

LIDP=ExSPS[LSSL(m(xS),xS)]+ExTPT[LSSL(m(xT),xT)]\mathcal{L}_{IDP} = \mathbb{E}_{x_S \sim P_S}[\mathcal{L}_{SSL}(m(x_S), x_S)] + \mathbb{E}_{x_T \sim P_T}[\mathcal{L}_{SSL}(m(x_T), x_T)]

  • Stage 2: Attention-Conditioned Masking Consistency For each target image xTx_T, the per-patch self-attention map is computed from the final transformer layer. The average attention (over heads) from the classification token to each patch produces a probability distribution aTa_T over spatial locations. Using a user-defined mask-ratio mrmr and committee size kk, kk binary masks are constructed greedily such that each mask blocks out a different subset of patches with highest cumulative attention, preserving (1mr)N(1-mr)N patches per masked view. For each mask, the classifier prediction is computed. A target image is deemed “reliable”—and thus eligible for self-training—if it is either highly confident (maxypΘ(yxT)>T\max_y p_\Theta(y \mid x_T) > T) or prediction-consistent across all masked views (i.e., the predicted label under each masking agrees with the unmasked prediction). Only such reliable target images are pseudo-labeled and used in the self-training loss alongside the source classification objective.

1.2 Core Equations and Pseudocode

  • Attention Distribution:

a^T(h)=A0,1:N(h),a^T=1Mh=1Ma^T(h),aT=a^Tja^Tj\hat{a}_T^{(h)} = A_{0,1:N}^{(h)}, \quad \hat{a}_T = \frac{1}{M} \sum_{h=1}^M \hat{a}_T^{(h)}, \quad a_T = \frac{\hat{a}_T}{\sum_j \hat{a}_T^j}

  • Mask Construction: Patches sorted by aTa_T; the top (1mr)N(1-mr)N assigned round-robin to kk masks.
  • Reliability Condition:

r(xT)={1,if (j,  argmaxp(ymj(xT))=argmaxp(yxT))(maxp(yxT)>T) 0,otherwiser(x_T) = \begin{cases} 1, & \text{if } (\forall j,\; \arg\max p(y|m_j(x_T)) = \arg\max p(y|x_T)) \vee (\max p(y|x_T) > T) \ 0, & \text{otherwise} \end{cases}

  • Overall Loss:

LPACMAC=E(xS,yS)PS[LCE(f(xS),yS)]+αExTPT[r(xT)LCE(f(mk(xT)),y^T)]\mathcal{L}_{PACMAC} = \mathbb{E}_{(x_S, y_S) \sim P_S}[\mathcal{L}_{CE}(f(x_S), y_S)] + \alpha\, \mathbb{E}_{x_T \sim P_T}[r(x_T)\, \mathcal{L}_{CE}(f(m_k(x_T)), \hat y_T)]

  • Key Hyperparameters: mask-ratio mrmr (fraction of masked patches, e.g., 0.75), committee size kk (typically 2–3), confidence threshold TT (e.g., 0.5), self-training weight α\alpha (e.g., 0.1).

This approach removes the need for adversarial alignment or explicit clustering, integrating tightly with native ViT attention and masking structures.

2. Task-Specific PACMAC in Transformer Summarization

An alternative instantiation of attention-conditioned masking—here called attention head masking—addresses inference-time content selection in encoder–decoder transformers for abstractive summarization (Cao et al., 2021). In this context, masking is used not for self-consistency checking but to enforce the model’s focus on salient input tokens during generation.

2.1 Masking Mechanism and Integration

  • Saliency Determination: A binary vector s{0,1}Ns \in \{0,1\}^N is predicted per source token, derived from either oracle alignment (LCS) or a RoBERTa+MLP external tagger. From ss, a logit mask mim_i is created: mi=0m_i = 0 if si=1s_i = 1 (attendable), mi=m_i = -\infty otherwise.
  • Masked Attention Calculation: In selected heads, the mask matrix M~(,h)\widetilde M^{(\ell,h)} is added to the attention logits before the softmax, forcibly zeroing attention to non-salient positions.
  • Head Selection: Heads are ranked by relative ROUGE gain under oracle masking; typically, 12–16 heads concentrated in the upper decoder layers are chosen for masking.

2.2 Decoding Procedure

During decoding, the masking described above is applied at each step for relevant heads. The process includes saliency tagging, mask matrix construction, and modified (masked) attention computation within the beam decoding loop. No change is made to training objectives; masking is strictly applied at inference.

3. Empirical Results and Benchmarks

PACMAC demonstrates consistent empirical advantages across diverse benchmarks. In vision domain adaptation (Prabhu et al., 2022):

Benchmark Method MAE Initialization DINO Initialization
OfficeHome PACMAC 66.8% 69.7%
OfficeHome SENTRY 65.5% 69.5%
OfficeHome Source Only 59.5% 65.7%
DomainNet PACMAC 81.6% 81.0%
DomainNet SENTRY 80.5% 80.4%
DomainNet Source Only 70.1% 76.0%
VisDA PACMAC 81.0%
VisDA SENTRY 76.0%
VisDA Source Only 68.9%

In summarization (Cao et al., 2021), masking yields significant ROUGE-1/2/L improvements on CNN/DailyMail and NYT when benchmarked against BART, with masked models retaining or exceeding informativeness and faithfulness in human judgments. Masked BART models achieved full-data performance after exposure to as little as 10%–20% of training data and improved cross-domain generalization when the selector was trained in the target domain.

4. Theoretical Rationale and Mode of Operation

PACMAC leverages the observation that model attention maps encode information about task-relevant regions—object-centric in images, salient tokens in text. By systematically masking high-attention regions and measuring prediction stability, PACMAC for vision enables reliable pseudo-label selection even under substantial domain shift. Conversely, in summarization, applying hard attention masks at inference sharply constrains the model to operate on source segments deemed relevant, reducing drift and non-salient copying, particularly benefiting extractive or mixed extractive–abstractive summarization styles.

A plausible implication is that PACMAC mechanisms directly leverage the interpretability and flexibility afforded by transformer attention maps, bridging the gap between architecture-native supervision (SSL masking) and downstream adaptation or control without requiring auxiliary networks or complex adversarial objectives.

5. Limitations and Future Directions

Attention-conditioned masking, while effective, has modality-specific limitations. In highly abstractive summarization domains (e.g., XSum), forcibly masking to copy-style saliency can be detrimental or neutral since generation diverges from extractive alignment (Cao et al., 2021). In vision, extreme mask ratios or large consistency committees may induce prediction instability, reducing the reliable target set. Both variants currently rely on externally determined saliency (in summarization) or simple self-attention pooling (in vision) that could potentially be improved via adaptive, learnable, or jointly optimized masking policies.

Future work identified includes dynamic per-example head selection, soft masking schedules, and closer integration or joint training of mask selector and primary model, possibly extending PACMAC applicability to more generative or multimodal scenarios (Cao et al., 2021).

6. Relationship to Prior and Contemporary Methods

PACMAC for vision is distinguished from alignment and self-training-based domain adaptation methods by its use of ViT’s native masking/attention mechanisms in both pre-adaptation and consistency probing, entirely bypassing adversarial losses (as in CDAN, MCC) or explicit clustering. In summarization, PACMAC differs from traditional training-time alignment by introducing inference-only masking, reshaping content selection in trained models without the need for retraining.

Notably, both streams of PACMAC research were introduced in the context of transformers—ViTs for vision (Prabhu et al., 2022) and encoder–decoder models for text (Cao et al., 2021)—where access to interpretable, multi-head attention maps enables effective conditional masking policies aligned with the transformer’s computational structure.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Attention-Conditioned Masking (PACMAC).