Order-Level Attention (OLA)

Updated 14 November 2025

Order-Level Attention (OLA) is a mechanism that generalizes traditional self-attention by incorporating higher-order (e.g., pairwise, triplet) interactions among tokens.
OLA decomposes the attention rollout across model layers to reveal latent structures and enrich context aggregation in deep neural architectures.
Practical applications of OLA include enhancing transformer performance, enabling cross-model adapter transfer, and providing diagnostic insights into model behavior.

Order-Level Attention (OLA), also termed higher-order attention, encompasses a spectrum of mechanisms for modeling and analyzing feature interactions among tokens in deep neural architectures. OLA generalizes conventional (first-order) attention by capturing not only linear but also higher-order (e.g., pairwise, triplet, and beyond) interactions, and includes methods for decomposing or parameterizing these interactions both within models and across different model instances. As such, OLA provides both a principled modeling tool for richer context aggregation and a diagnostic for uncovering shared latent structures across pre-trained LLMs.

1. Formal Definition and Mathematical Framework

Conventional self-attention mechanisms, as exemplified by the Transformer architecture, operate at first order: each query $q\in \mathbb{R}^d$ computes alignment scores against keys $k_i\in\mathbb{R}^d$ using a linear compatibility (e.g., dot-product), followed by a softmax-weighted sum over value vectors $v_i\in\mathbb{R}^d$ . This linear interaction is limited to additive (1st-order) relationships between query and key.

Order-Level Attention generalizes this by introducing explicit higher-order interactions. For the second-order case, a Bilinear Attention Block maps the interaction to

$B_i^k = \mathrm{ReLU}(W_k k_i)\odot \mathrm{ReLU}(W_q^k q) \in \mathbb{R}^d,$

where $W_k, W_q^k \in \mathbb{R}^{d \times d}$ and $\odot$ denotes element-wise multiplication. This term captures pairwise interactions (outer products in embedding-space) between each dimension of $q$ and $k_i$ . Stacking multiple such blocks extends this to higher $p$ -order interactions, and, optionally, replacing ReLU with ELU and leveraging Taylor expansion of exponentials, a continuous infinite-order feature fusion.

In an orthogonal but related approach, OLA may refer to an order-wise decomposition of the cumulative context-aggregation matrix in a transformer. The rollout of attention across $N$ layers,

$\hat{A} = \prod_{i=1}^N (A^{(i)} + I),$

with $A^{(i)}$ as the $i$ th layer’s attention matrix and $I$ as identity, can be expanded to sum over all possible paths grouped by order (number of attention steps). The order- $k$ OLA component is the mean of all $k$ -path order products: $\hat{A}^{(k)} = \frac{1}{\binom{N}{k}} \sum_{1 \leq i_1 < \cdots < i_k \leq N} A^{(i_k)} A^{(i_{k-1})} \cdots A^{(i_1)}.$ Here $\hat{A}^{(0)}=I$ (pure skip paths), while $\hat{A}^{(1)}$ corresponds to all single-layer attention applications, and so on up to $\hat{A}^{(N)}$ .

2. Algorithmic Realizations

Bilinear Attention Block Construction

For feature modeling, the Bilinear Attention Block proceeds as follows:

Query–Key Bilinear Interaction:
- $B_i^k = \mathrm{ReLU}(W_k k_i)\odot \mathrm{ReLU}(W_q^k q)$
Contextual Attention Weighting:
- $B_i^{\prime k} = \mathrm{ReLU}(W_B^k B_i^k)$
- $b_i^s = W_b B_i^{\prime k}$ , $\beta^s = \mathrm{softmax}(b^s)$
Channel-Wise Attention (Squeeze-Excitation):
- $\bar B = \frac{1}{n} \sum_{i=1}^{n} B_i^{\prime k}$
- $b^c = W_e \bar B$ , $\beta^c = \sigma(b^c)$
Query–Value Bilinear Enhancement:
- $B_i^v = \mathrm{ReLU}(W_v v_i)\odot \mathrm{ReLU}(W_q^v q)$
- Output: $\hat{v} = \beta^c \odot \sum_{i=1}^n \beta^s_i B_i^v$

Stacking $N$ Bilinear Attention Blocks produces interactions up to order $2N$. For infinite-order interactions, replace $\mathrm{ReLU}$ with $\mathrm{ELU}$ , leveraging the identity $\exp(W_X X)\odot \exp(W_Y Y)$ , which expands into all possible interaction orders via Taylor expansion.

Efficient OLA Computation for Transformers

Given a suite of attention matrices $\{A^{(1)}, ..., A^{(N)}\}$ ( $L\times L$ each):

For each $k$ in $0..K$ (with $K\leq 3$ in most empirical cases), enumerate all index subsets $C$ of size $k$ from $\{1..N\}$ .
For each $C=\{i_1<...<i_k\}$ , compute the product $P_C = A^{(i_k)}A^{(i_{k-1})}...A^{(i_1)}$ .
Average over all such $P_C$ to form $\hat{A}^{(k)}$ .

This computation is made tractable by using dynamic programming to build up products recursively.

3. Model Architectures and Integration Points

HAN integrates order-level attention via the following architecture:

Self-Attentive Embedder: Shared BiLSTM yields hidden states $H$ ; label-attention layers derive intent ( $H_I$ ) and slot ( $H_S$ ) embeddings.
Higher-Order Attention Encoder: A stack of $N$ sublayers, each containing two Bilinear Attention Blocks (intent $\rightarrow$ slot and slot $\rightarrow$ intent), with each followed by a residual connection and layer normalization.

$H_I^{(\ell)} = \mathrm{LN}(H_I^{(\ell-1)} + \hat{V}_I^{(\ell)})$

Analogously for $H_S^{(\ell)}$ .

Dynamic Feature Fusion: Sigmoid gates $\alpha_I$ , $\alpha_S$ fuse $H_I^{(N)}$ and $H_S^{(N)}$ ; output is passed through positionwise feed-forward layers and normalization.
Prediction Heads: For intent, max-pool and softmax classifier; for slot, linear projection and CRF.

Order-wise decomposition of attention rollout is used not only for probing but also for constructing cross-model adapters:

OLA Extraction: Compute $\hat{A}^{(1)}$ , $\hat{A}^{(2)}$ , ... from model attention matrices.
Adapter Network (TOA): Concatenate first and second order OLAs as input, apply $1\times 1$ convolution, axial transformer layers, and extract diagonal elements for per-token task classification. Unlike other adaptation approaches, the TOA requires no updates to target model parameters.
Test-time Transfer: Given an unseen model, extract its OLA, feed through the trained adapter, and perform downstream tasks with no further fine-tuning.

4. Empirical Findings and Performance Analysis

Spoken Language Understanding (SLU) Benchmarks

On SNIPS (14,000 utterances, 7 intents) and ATIS (5,000 utterances, 17 intents):

Model Variant	Overall Frame Acc. (SNIPS)	Overall Frame Acc. (ATIS)
BiLSTM+decoder	85.9 %	85.0 %
+ Label-attn & shallow cat	87.9 %	85.9 %
+ 1st-order attention	88.1 %	87.0 %
+ Bilinear block (2nd)	88.6 %	87.3 %
+ Dynamic fusion	89.4 %	87.6 %
+ ELU (inf. order, HAN)	90.4 %	88.1 %
HAN + BERT	93.5 %	89.3 %

Bilinear (second-order) and infinite-order attention consistently improve on both slot F $_1$ and intent accuracy. Integrating bilinear attention blocks into state-of-the-art joint models (Stack-Propagation, Co-Interactive) further yields consistent gains.

Ablation studies demonstrate increased robustness to learning rate variation and improved attention map sharpness (i.e., more focused keyword alignment).

Cross-LLM Commonality

Visualization and quantitative analysis of OLA matrices across models (BERT, RoBERTa, Electra, Qwen2, Gemma2, LLaMA3) confirm that same-sentence, same-order OLAs are highly similar across models, despite architectural and pre-training variations.

Quantitatively:

Classification Proxy: ResNet-18 distinguishes sentence classes with >90 % accuracy using OLA maps and generalizes across models; raw rollout and norm-based approaches perform worse.
Image-retrieval Proxy: First-order OLA yields Hits@5 > 95 % in inter-model map retrieval, confirming latent alignment.

Downstream Transfer via TOA

Transferable OLA Adapter (TOA) demonstrates zero-parameter cross-model transfer:

Task	Zero-shot Baseline	TOA (cross-LM)
RE (SemEval)	~7 %	~35 % acc.
NER (CoNLL)	~5 %	~50 % F $_1$
DP (UAS/LAS)	$\ll$ 10 %	$\sim$ 60 %/40 %
POS	$<$ 1 %	$>$ 70 % acc.

TOA achieves $\geq$ 90 % of its self-test performance on many source-target pairs, without any model-specific fine-tuning.

5. Linguistic and Theoretical Insights

First-order OLA aligns closely with syntactic dependency structure, as shown by probing experiments: first-order maps alone yield $>80\%$ UAS / $>72\%$ LAS for MLMs in dependency parsing, compared to $>60\%$ for CLMs. Higher orders encode less syntax and more residual or background aggregation. Empirically, boosts beyond second-order tend to saturate; deeper stacking of bilinear blocks leads to overfitting.

This suggests that the first order captures essential linguistic scaffolding, with diminishing returns for higher orders, especially in architectures with strong skip connections.

6. Broader Implications and Potential Applications

OLA formalism uncovers a shared, latent scaffolding across pretrained transformers, independent of training corpus or architecture. Notable implications include:

Unified Model Probing: OLA acts as a model-agnostic feature for investigating shared and idiosyncratic properties across LLMs.
Model Similarity Metrics: OLA can serve as a basis for quantifying model proximity, potentially guiding checkpoint selection or domain adaptation strategies.
Privacy-Preserving Adaptation: Since TOA adapters see only OLA matrices—and not original embeddings or tokens—they offer privacy advantages in sensitive applications.
Dynamic Adapter Composition: Adapters trained on smaller or open models generalize to large, even closed-source models via OLA as the input interface, enabling rapid deployment without re-training.
Linguistic Diagnostics: OLA enables model-agnostic probing of syntax, and potentially semantics and discourse.

Broader implications include the formalization of context-aggregation patterns as a core axis of model similarity and the prospect of constructing universally applicable adapters or analytic tools for LLMs.

7. Limitations and Sensitivities

While OLA provides robust improvements, stacking more than two bilinear layers leads to overfitting, and infinite-order expansion with ELU must be regularized. Higher-order components rapidly lose alignment with interpretable linguistic phenomena. The transferability of OLA-based adapters is strongest for first- and second-order, while higher orders mainly encode model-internal aggregation bias. OLA-derived diagnostics rely on precise calculation of attention matrices, and attention-sink phenomena may confound high-order rollout for very deep models.

Overall, Order-Level Attention unifies families of higher-order neural interaction mechanisms and opens a promising direction for model-agnostic feature extraction, transfer learning, and linguistic probing.

PDF Markdown Chat (Pro)

References (2)

HAN: Higher-order Attention Network for Spoken Language Understanding (2021)

Order-Level Attention Similarity Across Language Models: A Latent Commonality (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Order-Level Attention (OLA).

Order-Level Attention (OLA)

1. Formal Definition and Mathematical Framework

2. Algorithmic Realizations

Bilinear Attention Block Construction

Efficient OLA Computation for Transformers

3. Model Architectures and Integration Points

Higher-order Attention Network (HAN) (Chen et al., 2021)

Order-Level Attention Analysis and Adapter Transfer (Liang et al., 7 Nov 2025)

4. Empirical Findings and Performance Analysis

Spoken Language Understanding (SLU) Benchmarks

Cross-LLM Commonality

Downstream Transfer via TOA

5. Linguistic and Theoretical Insights

6. Broader Implications and Potential Applications

7. Limitations and Sensitivities

Whiteboard

Follow Topic

Continue Learning

Order-Level Attention (OLA)

1. Formal Definition and Mathematical Framework

2. Algorithmic Realizations

Bilinear Attention Block Construction

Efficient OLA Computation for Transformers

3. Model Architectures and Integration Points

Higher-order Attention Network (HAN) (Chen et al., 2021)

Order-Level Attention Analysis and Adapter Transfer (Liang et al., 7 Nov 2025)

4. Empirical Findings and Performance Analysis

Spoken Language Understanding (SLU) Benchmarks

Cross-LLM Commonality

Downstream Transfer via TOA

5. Linguistic and Theoretical Insights

6. Broader Implications and Potential Applications

7. Limitations and Sensitivities

Whiteboard

Follow Topic

Continue Learning

Related Topics