Order-Level Attention (OLA)
- Order-Level Attention (OLA) is a mechanism that generalizes traditional self-attention by incorporating higher-order (e.g., pairwise, triplet) interactions among tokens.
- OLA decomposes the attention rollout across model layers to reveal latent structures and enrich context aggregation in deep neural architectures.
- Practical applications of OLA include enhancing transformer performance, enabling cross-model adapter transfer, and providing diagnostic insights into model behavior.
Order-Level Attention (OLA), also termed higher-order attention, encompasses a spectrum of mechanisms for modeling and analyzing feature interactions among tokens in deep neural architectures. OLA generalizes conventional (first-order) attention by capturing not only linear but also higher-order (e.g., pairwise, triplet, and beyond) interactions, and includes methods for decomposing or parameterizing these interactions both within models and across different model instances. As such, OLA provides both a principled modeling tool for richer context aggregation and a diagnostic for uncovering shared latent structures across pre-trained LLMs.
1. Formal Definition and Mathematical Framework
Conventional self-attention mechanisms, as exemplified by the Transformer architecture, operate at first order: each query computes alignment scores against keys using a linear compatibility (e.g., dot-product), followed by a softmax-weighted sum over value vectors . This linear interaction is limited to additive (1st-order) relationships between query and key.
Order-Level Attention generalizes this by introducing explicit higher-order interactions. For the second-order case, a Bilinear Attention Block maps the interaction to
where and denotes element-wise multiplication. This term captures pairwise interactions (outer products in embedding-space) between each dimension of and . Stacking multiple such blocks extends this to higher -order interactions, and, optionally, replacing ReLU with ELU and leveraging Taylor expansion of exponentials, a continuous infinite-order feature fusion.
In an orthogonal but related approach, OLA may refer to an order-wise decomposition of the cumulative context-aggregation matrix in a transformer. The rollout of attention across layers,
with as the th layer’s attention matrix and as identity, can be expanded to sum over all possible paths grouped by order (number of attention steps). The order- OLA component is the mean of all -path order products: Here (pure skip paths), while corresponds to all single-layer attention applications, and so on up to .
2. Algorithmic Realizations
Bilinear Attention Block Construction
For feature modeling, the Bilinear Attention Block proceeds as follows:
- Query–Key Bilinear Interaction:
- Contextual Attention Weighting:
- ,
- Channel-Wise Attention (Squeeze-Excitation):
- ,
- Query–Value Bilinear Enhancement:
- Output:
Stacking Bilinear Attention Blocks produces interactions up to order $2N$. For infinite-order interactions, replace with , leveraging the identity , which expands into all possible interaction orders via Taylor expansion.
Efficient OLA Computation for Transformers
Given a suite of attention matrices ( each):
- For each in $0..K$ (with in most empirical cases), enumerate all index subsets of size from .
- For each , compute the product .
- Average over all such to form .
This computation is made tractable by using dynamic programming to build up products recursively.
3. Model Architectures and Integration Points
Higher-order Attention Network (HAN) (Chen et al., 2021)
HAN integrates order-level attention via the following architecture:
- Self-Attentive Embedder: Shared BiLSTM yields hidden states ; label-attention layers derive intent () and slot () embeddings.
- Higher-Order Attention Encoder: A stack of sublayers, each containing two Bilinear Attention Blocks (intent slot and slot intent), with each followed by a residual connection and layer normalization.
Analogously for .
- Dynamic Feature Fusion: Sigmoid gates , fuse and ; output is passed through positionwise feed-forward layers and normalization.
- Prediction Heads: For intent, max-pool and softmax classifier; for slot, linear projection and CRF.
Order-Level Attention Analysis and Adapter Transfer (Liang et al., 7 Nov 2025)
Order-wise decomposition of attention rollout is used not only for probing but also for constructing cross-model adapters:
- OLA Extraction: Compute , , ... from model attention matrices.
- Adapter Network (TOA): Concatenate first and second order OLAs as input, apply convolution, axial transformer layers, and extract diagonal elements for per-token task classification. Unlike other adaptation approaches, the TOA requires no updates to target model parameters.
- Test-time Transfer: Given an unseen model, extract its OLA, feed through the trained adapter, and perform downstream tasks with no further fine-tuning.
4. Empirical Findings and Performance Analysis
Spoken Language Understanding (SLU) Benchmarks
On SNIPS (14,000 utterances, 7 intents) and ATIS (5,000 utterances, 17 intents):
| Model Variant | Overall Frame Acc. (SNIPS) | Overall Frame Acc. (ATIS) |
|---|---|---|
| BiLSTM+decoder | 85.9 % | 85.0 % |
| + Label-attn & shallow cat | 87.9 % | 85.9 % |
| + 1st-order attention | 88.1 % | 87.0 % |
| + Bilinear block (2nd) | 88.6 % | 87.3 % |
| + Dynamic fusion | 89.4 % | 87.6 % |
| + ELU (inf. order, HAN) | 90.4 % | 88.1 % |
| HAN + BERT | 93.5 % | 89.3 % |
Bilinear (second-order) and infinite-order attention consistently improve on both slot F and intent accuracy. Integrating bilinear attention blocks into state-of-the-art joint models (Stack-Propagation, Co-Interactive) further yields consistent gains.
Ablation studies demonstrate increased robustness to learning rate variation and improved attention map sharpness (i.e., more focused keyword alignment).
Cross-LLM Commonality
Visualization and quantitative analysis of OLA matrices across models (BERT, RoBERTa, Electra, Qwen2, Gemma2, LLaMA3) confirm that same-sentence, same-order OLAs are highly similar across models, despite architectural and pre-training variations.
Quantitatively:
- Classification Proxy: ResNet-18 distinguishes sentence classes with >90 % accuracy using OLA maps and generalizes across models; raw rollout and norm-based approaches perform worse.
- Image-retrieval Proxy: First-order OLA yields Hits@5 > 95 % in inter-model map retrieval, confirming latent alignment.
Downstream Transfer via TOA
Transferable OLA Adapter (TOA) demonstrates zero-parameter cross-model transfer:
| Task | Zero-shot Baseline | TOA (cross-LM) |
|---|---|---|
| RE (SemEval) | ~7 % | ~35 % acc. |
| NER (CoNLL) | ~5 % | ~50 % F |
| DP (UAS/LAS) | 10 % | 60 %/40 % |
| POS | 1 % | 70 % acc. |
TOA achieves 90 % of its self-test performance on many source-target pairs, without any model-specific fine-tuning.
5. Linguistic and Theoretical Insights
First-order OLA aligns closely with syntactic dependency structure, as shown by probing experiments: first-order maps alone yield UAS / LAS for MLMs in dependency parsing, compared to for CLMs. Higher orders encode less syntax and more residual or background aggregation. Empirically, boosts beyond second-order tend to saturate; deeper stacking of bilinear blocks leads to overfitting.
This suggests that the first order captures essential linguistic scaffolding, with diminishing returns for higher orders, especially in architectures with strong skip connections.
6. Broader Implications and Potential Applications
OLA formalism uncovers a shared, latent scaffolding across pretrained transformers, independent of training corpus or architecture. Notable implications include:
- Unified Model Probing: OLA acts as a model-agnostic feature for investigating shared and idiosyncratic properties across LLMs.
- Model Similarity Metrics: OLA can serve as a basis for quantifying model proximity, potentially guiding checkpoint selection or domain adaptation strategies.
- Privacy-Preserving Adaptation: Since TOA adapters see only OLA matrices—and not original embeddings or tokens—they offer privacy advantages in sensitive applications.
- Dynamic Adapter Composition: Adapters trained on smaller or open models generalize to large, even closed-source models via OLA as the input interface, enabling rapid deployment without re-training.
- Linguistic Diagnostics: OLA enables model-agnostic probing of syntax, and potentially semantics and discourse.
Broader implications include the formalization of context-aggregation patterns as a core axis of model similarity and the prospect of constructing universally applicable adapters or analytic tools for LLMs.
7. Limitations and Sensitivities
While OLA provides robust improvements, stacking more than two bilinear layers leads to overfitting, and infinite-order expansion with ELU must be regularized. Higher-order components rapidly lose alignment with interpretable linguistic phenomena. The transferability of OLA-based adapters is strongest for first- and second-order, while higher orders mainly encode model-internal aggregation bias. OLA-derived diagnostics rely on precise calculation of attention matrices, and attention-sink phenomena may confound high-order rollout for very deep models.
Overall, Order-Level Attention unifies families of higher-order neural interaction mechanisms and opens a promising direction for model-agnostic feature extraction, transfer learning, and linguistic probing.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free