Papers
Topics
Authors
Recent
2000 character limit reached

Order-Level Attention (OLA)

Updated 14 November 2025
  • Order-Level Attention (OLA) is a mechanism that generalizes traditional self-attention by incorporating higher-order (e.g., pairwise, triplet) interactions among tokens.
  • OLA decomposes the attention rollout across model layers to reveal latent structures and enrich context aggregation in deep neural architectures.
  • Practical applications of OLA include enhancing transformer performance, enabling cross-model adapter transfer, and providing diagnostic insights into model behavior.

Order-Level Attention (OLA), also termed higher-order attention, encompasses a spectrum of mechanisms for modeling and analyzing feature interactions among tokens in deep neural architectures. OLA generalizes conventional (first-order) attention by capturing not only linear but also higher-order (e.g., pairwise, triplet, and beyond) interactions, and includes methods for decomposing or parameterizing these interactions both within models and across different model instances. As such, OLA provides both a principled modeling tool for richer context aggregation and a diagnostic for uncovering shared latent structures across pre-trained LLMs.

1. Formal Definition and Mathematical Framework

Conventional self-attention mechanisms, as exemplified by the Transformer architecture, operate at first order: each query qRdq\in \mathbb{R}^d computes alignment scores against keys kiRdk_i\in\mathbb{R}^d using a linear compatibility (e.g., dot-product), followed by a softmax-weighted sum over value vectors viRdv_i\in\mathbb{R}^d. This linear interaction is limited to additive (1st-order) relationships between query and key.

Order-Level Attention generalizes this by introducing explicit higher-order interactions. For the second-order case, a Bilinear Attention Block maps the interaction to

Bik=ReLU(Wkki)ReLU(Wqkq)Rd,B_i^k = \mathrm{ReLU}(W_k k_i)\odot \mathrm{ReLU}(W_q^k q) \in \mathbb{R}^d,

where Wk,WqkRd×dW_k, W_q^k \in \mathbb{R}^{d \times d} and \odot denotes element-wise multiplication. This term captures pairwise interactions (outer products in embedding-space) between each dimension of qq and kik_i. Stacking multiple such blocks extends this to higher pp-order interactions, and, optionally, replacing ReLU with ELU and leveraging Taylor expansion of exponentials, a continuous infinite-order feature fusion.

In an orthogonal but related approach, OLA may refer to an order-wise decomposition of the cumulative context-aggregation matrix in a transformer. The rollout of attention across NN layers,

A^=i=1N(A(i)+I),\hat{A} = \prod_{i=1}^N (A^{(i)} + I),

with A(i)A^{(i)} as the iith layer’s attention matrix and II as identity, can be expanded to sum over all possible paths grouped by order (number of attention steps). The order-kk OLA component is the mean of all kk-path order products: A^(k)=1(Nk)1i1<<ikNA(ik)A(ik1)A(i1).\hat{A}^{(k)} = \frac{1}{\binom{N}{k}} \sum_{1 \leq i_1 < \cdots < i_k \leq N} A^{(i_k)} A^{(i_{k-1})} \cdots A^{(i_1)}. Here A^(0)=I\hat{A}^{(0)}=I (pure skip paths), while A^(1)\hat{A}^{(1)} corresponds to all single-layer attention applications, and so on up to A^(N)\hat{A}^{(N)}.

2. Algorithmic Realizations

Bilinear Attention Block Construction

For feature modeling, the Bilinear Attention Block proceeds as follows:

  • Query–Key Bilinear Interaction:
    • Bik=ReLU(Wkki)ReLU(Wqkq)B_i^k = \mathrm{ReLU}(W_k k_i)\odot \mathrm{ReLU}(W_q^k q)
  • Contextual Attention Weighting:
    • Bik=ReLU(WBkBik)B_i^{\prime k} = \mathrm{ReLU}(W_B^k B_i^k)
    • bis=WbBikb_i^s = W_b B_i^{\prime k}, βs=softmax(bs)\beta^s = \mathrm{softmax}(b^s)
  • Channel-Wise Attention (Squeeze-Excitation):
    • Bˉ=1ni=1nBik\bar B = \frac{1}{n} \sum_{i=1}^{n} B_i^{\prime k}
    • bc=WeBˉb^c = W_e \bar B, βc=σ(bc)\beta^c = \sigma(b^c)
  • Query–Value Bilinear Enhancement:
    • Biv=ReLU(Wvvi)ReLU(Wqvq)B_i^v = \mathrm{ReLU}(W_v v_i)\odot \mathrm{ReLU}(W_q^v q)
    • Output: v^=βci=1nβisBiv\hat{v} = \beta^c \odot \sum_{i=1}^n \beta^s_i B_i^v

Stacking NN Bilinear Attention Blocks produces interactions up to order $2N$. For infinite-order interactions, replace ReLU\mathrm{ReLU} with ELU\mathrm{ELU}, leveraging the identity exp(WXX)exp(WYY)\exp(W_X X)\odot \exp(W_Y Y), which expands into all possible interaction orders via Taylor expansion.

Efficient OLA Computation for Transformers

Given a suite of attention matrices {A(1),...,A(N)}\{A^{(1)}, ..., A^{(N)}\} (L×LL\times L each):

  1. For each kk in $0..K$ (with K3K\leq 3 in most empirical cases), enumerate all index subsets CC of size kk from {1..N}\{1..N\}.
  2. For each C={i1<...<ik}C=\{i_1<...<i_k\}, compute the product PC=A(ik)A(ik1)...A(i1)P_C = A^{(i_k)}A^{(i_{k-1})}...A^{(i_1)}.
  3. Average over all such PCP_C to form A^(k)\hat{A}^{(k)}.

This computation is made tractable by using dynamic programming to build up products recursively.

3. Model Architectures and Integration Points

HAN integrates order-level attention via the following architecture:

  1. Self-Attentive Embedder: Shared BiLSTM yields hidden states HH; label-attention layers derive intent (HIH_I) and slot (HSH_S) embeddings.
  2. Higher-Order Attention Encoder: A stack of NN sublayers, each containing two Bilinear Attention Blocks (intent \rightarrow slot and slot \rightarrow intent), with each followed by a residual connection and layer normalization.

HI()=LN(HI(1)+V^I())H_I^{(\ell)} = \mathrm{LN}(H_I^{(\ell-1)} + \hat{V}_I^{(\ell)})

Analogously for HS()H_S^{(\ell)}.

  1. Dynamic Feature Fusion: Sigmoid gates αI\alpha_I, αS\alpha_S fuse HI(N)H_I^{(N)} and HS(N)H_S^{(N)}; output is passed through positionwise feed-forward layers and normalization.
  2. Prediction Heads: For intent, max-pool and softmax classifier; for slot, linear projection and CRF.

Order-wise decomposition of attention rollout is used not only for probing but also for constructing cross-model adapters:

  • OLA Extraction: Compute A^(1)\hat{A}^{(1)}, A^(2)\hat{A}^{(2)}, ... from model attention matrices.
  • Adapter Network (TOA): Concatenate first and second order OLAs as input, apply 1×11\times 1 convolution, axial transformer layers, and extract diagonal elements for per-token task classification. Unlike other adaptation approaches, the TOA requires no updates to target model parameters.
  • Test-time Transfer: Given an unseen model, extract its OLA, feed through the trained adapter, and perform downstream tasks with no further fine-tuning.

4. Empirical Findings and Performance Analysis

Spoken Language Understanding (SLU) Benchmarks

On SNIPS (14,000 utterances, 7 intents) and ATIS (5,000 utterances, 17 intents):

Model Variant Overall Frame Acc. (SNIPS) Overall Frame Acc. (ATIS)
BiLSTM+decoder 85.9 % 85.0 %
+ Label-attn & shallow cat 87.9 % 85.9 %
+ 1st-order attention 88.1 % 87.0 %
+ Bilinear block (2nd) 88.6 % 87.3 %
+ Dynamic fusion 89.4 % 87.6 %
+ ELU (inf. order, HAN) 90.4 % 88.1 %
HAN + BERT 93.5 % 89.3 %

Bilinear (second-order) and infinite-order attention consistently improve on both slot F1_1 and intent accuracy. Integrating bilinear attention blocks into state-of-the-art joint models (Stack-Propagation, Co-Interactive) further yields consistent gains.

Ablation studies demonstrate increased robustness to learning rate variation and improved attention map sharpness (i.e., more focused keyword alignment).

Cross-LLM Commonality

Visualization and quantitative analysis of OLA matrices across models (BERT, RoBERTa, Elec­tra, Qwen2, Gemma2, LLaMA3) confirm that same-sentence, same-order OLAs are highly similar across models, despite architectural and pre-training variations.

Quantitatively:

  • Classification Proxy: ResNet-18 distinguishes sentence classes with >90 % accuracy using OLA maps and generalizes across models; raw rollout and norm-based approaches perform worse.
  • Image-retrieval Proxy: First-order OLA yields Hits@5 > 95 % in inter-model map retrieval, confirming latent alignment.

Downstream Transfer via TOA

Transferable OLA Adapter (TOA) demonstrates zero-parameter cross-model transfer:

Task Zero-shot Baseline TOA (cross-LM)
RE (SemEval) ~7 % ~35 % acc.
NER (CoNLL) ~5 % ~50 % F1_1
DP (UAS/LAS) \ll10 % \sim60 %/40 %
POS <<1 % >>70 % acc.

TOA achieves \geq90 % of its self-test performance on many source-target pairs, without any model-specific fine-tuning.

5. Linguistic and Theoretical Insights

First-order OLA aligns closely with syntactic dependency structure, as shown by probing experiments: first-order maps alone yield >80%>80\% UAS / >72%>72\% LAS for MLMs in dependency parsing, compared to >60%>60\% for CLMs. Higher orders encode less syntax and more residual or background aggregation. Empirically, boosts beyond second-order tend to saturate; deeper stacking of bilinear blocks leads to overfitting.

This suggests that the first order captures essential linguistic scaffolding, with diminishing returns for higher orders, especially in architectures with strong skip connections.

6. Broader Implications and Potential Applications

OLA formalism uncovers a shared, latent scaffolding across pretrained transformers, independent of training corpus or architecture. Notable implications include:

  • Unified Model Probing: OLA acts as a model-agnostic feature for investigating shared and idiosyncratic properties across LLMs.
  • Model Similarity Metrics: OLA can serve as a basis for quantifying model proximity, potentially guiding checkpoint selection or domain adaptation strategies.
  • Privacy-Preserving Adaptation: Since TOA adapters see only OLA matrices—and not original embeddings or tokens—they offer privacy advantages in sensitive applications.
  • Dynamic Adapter Composition: Adapters trained on smaller or open models generalize to large, even closed-source models via OLA as the input interface, enabling rapid deployment without re-training.
  • Linguistic Diagnostics: OLA enables model-agnostic probing of syntax, and potentially semantics and discourse.

Broader implications include the formalization of context-aggregation patterns as a core axis of model similarity and the prospect of constructing universally applicable adapters or analytic tools for LLMs.

7. Limitations and Sensitivities

While OLA provides robust improvements, stacking more than two bilinear layers leads to overfitting, and infinite-order expansion with ELU must be regularized. Higher-order components rapidly lose alignment with interpretable linguistic phenomena. The transferability of OLA-based adapters is strongest for first- and second-order, while higher orders mainly encode model-internal aggregation bias. OLA-derived diagnostics rely on precise calculation of attention matrices, and attention-sink phenomena may confound high-order rollout for very deep models.

Overall, Order-Level Attention unifies families of higher-order neural interaction mechanisms and opens a promising direction for model-agnostic feature extraction, transfer learning, and linguistic probing.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Order-Level Attention (OLA).