SA-TE: Self-Attention with Text+Embeddings

Updated 1 December 2025

SA-TE is an approach that integrates multi-head self-attention maps with pooled Transformer embeddings to capture both semantic content and token alignment patterns.
It employs 2D convolutional filters to process aggregated attention tensors, producing task-informed features for effective counterfactual detection and span prediction.
The modular pipeline allows joint fine-tuning with task-specific heads, leveraging both deep contextual embeddings and structural attention cues for robust natural language understanding.

Self-Attention with Text+Embeddings (SA-TE) refers to architectural strategies that integrate self-attention weights from pre-trained Transformer models with pooled contextual embeddings, producing fused representations for downstream tasks. The canonical use-case, as exemplified by Patil and Baths in the SemEval-2020 Task 5 system, targets counterfactual detection via joint exploitation of multi-head attention maps and tokenwise encodings, employing subsequent convolutional and fusion operations (Patil et al., 2020). This paradigm tightly binds distributional semantics captured by contextual embeddings and the structural, token-alignment patterns revealed by attention, resulting in enriched features for sequence-level reasoning and span prediction.

1. Core Architectural Elements

The SA-TE model pipeline systematically incorporates both contextual embeddings and multi-head self-attention weights extracted from multiple upper layers of a pre-trained Transformer encoder. For an input sequence of $n$ tokens (after wordpiece tokenization, with boundary tokens $\langle\mathrm{CLS}\rangle$ and $\langle\mathrm{SEP}\rangle$ ), the Transformer produces a set of $L$ -layer hidden states and associated self-attention tensors. SA-TE specifically collects the final three layers’ outputs, forming:

Token embeddings: $\mathbf{E} = [\mathbf{E}^{(L-2)};\mathbf{E}^{(L-1)};\mathbf{E}^{(L)}] \in \mathbb{R}^{n \times (3d)}$
Stacked attention maps: $\mathcal{A} \in \mathbb{R}^{H' \times n \times n}$ , where $H' = 3H$ for $H$ heads per layer

The token embeddings are pooled (typically via $\mathbf{e}_{\text{pool}} = \mathbf{E}^{\text{CLS}}$ or mean pooling), yielding $\mathbf{e}_{\text{pool}} \in \mathbb{R}^{3d}$ . Simultaneously, the raw attention tensor $\mathcal{A}$ is processed by a 2D-CNN (with $m$ filters, spatial extent $k \times k$ ) to produce convolved features $\mathcal{F}$ , then flattened and projected through a linear block to obtain $\mathbf{e}_{\mathrm{att}} \in \mathbb{R}^{3d}$ . The two vectors are concatenated and normalized: $\mathbf{z}' = \mathrm{LayerNorm}([\mathbf{e}_{\text{pool}};\mathbf{e}_{\mathrm{att}}]) \in \mathbb{R}^{6d}$ , before being passed to classification or regression heads (Patil et al., 2020).

2. Detailed Processing Pipeline

The SA-TE sequence comprises the following operations:

Tokenization and Encoding: Raw text is tokenized to subword units, augmented with special classification/separation tokens. Tokens are forwarded through a pre-trained Transformer, generating hidden states and attention maps at each layer.
Feature Extraction:
- Embeddings: For each selected layer, hidden representations are concatenated to form a comprehensive embedding tensor.
- Attention: For all heads in the selected layers, attention matrices are aggregated into a 3D tensor.
2D-CNN Processing: The attention tensor $\mathcal{A}$ is processed with a set of $m$ 2D convolutional filters covering all $H'$ channels (attention heads across layers), yielding feature maps $\mathcal{F}$ that are passed through ReLU, (optionally batch-norm and dropout), and flattened.
Fusion and Normalization: The convolved (attention-derived) and pooled (embedding-derived) features are concatenated and normalized, forming a fused feature vector.
Prediction: Task-specific heads are attached:
- For binary classification (counterfactuality): a sigmoid-activated dense layer with binary cross-entropy loss.
- For span regression (antecedent/consequent detection): a linear head predicting 4-dim span vectors, optimized via smooth L1 loss. (Patil et al., 2020)

3. Practical Implementation and Fine-Tuning

The base architecture is kept fixed during joint or staged fine-tuning. Fine-tuning proceeds in two stages:

Stage 1: Attach the binary classifier head; fine-tune on counterfactual detection until convergence using AdamW with weight decay.
Stage 2: Replace the classifier with the span regression head; continue fine-tuning on the span detection objective. No auxiliary losses beyond span loss are applied.

Hyperparameters, such as learning rates and dropout rates, are selected via development-set tuning. Early stopping, validation splits, and conventional hardware (e.g., multiple GPUs) are used for efficient convergence (Patil et al., 2020).

4. Attention Analysis and Linguistic Interpretability

Qualitative analysis of attention-weight matrices after training reveals head specialization:

Heads $(2,4,12)$ consistently attend to auxiliary modals signaling counterfactuality (e.g., “could”, “would”).
Heads $(3,11)$ exhibit high attention to conditional conjunctions (“if”, “but”).
Heads $(5,7)$ specialize in numeric values or named entities, typically relevant to the consequent clause.
Heads $(1,6)$ focus on punctuation proximate to span boundaries.

These specializations are confirmed by both inspection of matrix values and visualization (e.g., token-level attention heatmaps), providing insight into the emergent linguistic structure extracted by the SA-TE model (Patil et al., 2020).

SA-TE’s approach to integrating self-attention with text embeddings stands in contrast to canonical meta-embedding designs such as “Duo” (2003.01371), which fuse multiple pretrained word embeddings via a self-attentive mechanism. While Duo meta-embedding achieves state-of-the-art performance in text classification and extends to machine translation by embedding the fusion principle at every multi-head attention site, SA-TE operates at the level of Transformer feature fusion rather than explicit meta-embedding of disjoint sources.

Distinctive elements of SA-TE include explicit convolutional processing of stacked self-attention maps to derive task-informed features, and the direct fusion with high-dimensional contextual embeddings. A plausible implication is that this architectural motif is well-suited for structured reasoning tasks (e.g., counterfactual detection, span annotation) where attention patterns encode relevant semantic-syntactic dependencies not fully captured by tokenwise vector aggregation.

6. Pseudocode and Operational Outline

The SA-TE algorithmic process, as published, is summarized below:

Inputs: raw text; pre-trained Transformer of L layers; conv filters {W_conv, b_conv};
        linear heads {W_f, b_f}, {W_cls, b_cls}, {W_reg, b_reg}.
Outputs: binary score ŷ (Task1) or spans ŝ (Task2).

1.  Tokenize text → tokens[1..n], add [CLS], [SEP].
2.  Compute hidden states and attentions:
    for ℓ in 1..L:
      obtain H^(ℓ)[1..n] and attention weights {A^(ℓ,h)}_{h=1..H}
    Stack last-three layers:
      E ← concat(H^(L−2),H^(L−1),H^(L))         # shape n×3d
      A ← stack({A^(L−2,h),A^(L−1,h),A^(L,h)}) # shape H'×n×n
3.  Pool transformer embeddings:
    e_pool ← H^(L)[CLS]   # or mean‐pool over n tokens
4.  Convolutional feature extraction:
    F ← Conv2D(A, W_conv, b_conv)      
    F ← ReLU(F)            # shape m×(n−k+1)×(n−k+1)
    f_flat ← flatten(F)
    e_att ← ReLU(W_f·f_flat + b_f)    # shape 3d
5.  Fuse and normalize:
    z ← concat(e_pool, e_att)         # shape 6d
    z' ← LayerNorm(z)
6.  Task-specific head:
    if Task1:
       ŷ ← sigmoid(W_cls·z' + b_cls)        
       loss ← BCE(ŷ, y_true)
    else if Task2:
       ŝ ← ReLU(W_reg·z' + b_reg)    # 4-dim vector
       loss ← SmoothL1(ŝ, s_true)
7.  Backpropagate loss, update all parameters via AdamW.
Return ŷ or ŝ.

(Patil et al., 2020)

7. Significance and Applicability

The SA-TE strategy demonstrates that coupling the rich, context-aware semantic representations of Transformer models with heterogeneous information encoded by multi-head attention matrices can substantially enhance performance on tasks demanding causal and counterfactual reasoning, as well as structured span prediction. The architectural modularity of SA-TE allows universal adaptation to Transformer backbones with little overhead, leveraging both the direct semantic content of embeddings and the relational/structural information of self-attention weights (Patil et al., 2020). This suggests broader applicability to other natural language understanding tasks where explicit exploitation of attention patterns is desirable.

Markdown Report Issue Upgrade to Chat

References (2)

CNRL at SemEval-2020 Task 5: Modelling Causal Reasoning in Language with Multi-Head Self-Attention Weights based Counterfactual Detection (2020)

Meta-Embeddings Based On Self-Attention (2020)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Self-Attention with Text+Embeddings (SA-TE).