Papers
Topics
Authors
Recent
2000 character limit reached

Transferable OLA Adapter for Cross-LM Transfer

Updated 14 November 2025
  • Transferable OLA Adapter (TOA) is a training-free, structure-based method that uses universal order-level attention to enable zero-shot transfer across language models.
  • It extracts and normalizes first- and second-order OLA representations and processes them with axial transformers to support various downstream tasks.
  • Empirical results show significant performance gains in relation extraction, NER, dependency parsing, and POS tagging across diverse language model architectures.

The Transferable OLA Adapter (TOA) is a training-free cross-language-model adapter that leverages universal order-level attention (OLA) representations to enable robust, zero-shot transfer of downstream task capabilities between distinct LLMs. TOA exploits the observation that, despite architectural and model differences, LMs trained on similar objectives tend to induce highly similar context-aggregation patterns when these are expressed as order-decomposed attention statistics. By abstracting away the source model and focusing on OLA as a latent, shared structural feature space, TOA enables adapters to generalize to unseen models without requiring additional parameter updates or fine-tuning.

1. Order-Level Attention: Mathematical Definition and Properties

Order-Level Attention (OLA) is a family of context-aggregation matrices derived from the layer-wise compositions of a Transformer LM’s attention-weight matrices, decoupled by interaction “order.” Given NN layers and input length LL, the averaged attention matrix at layer ii is A(i)RL×LA^{(i)} \in \mathbb{R}^{L \times L}, with II as the identity.

Standard attention rollout forms the aggregated matrix:

A^=i=1N(A(i)+I)\hat{A} = \prod_{i=1}^N (A^{(i)} + I)

Expanding by the binomial theorem and collecting terms by order kk (number of attention matrix multiplications), the kk-th-order OLA is:

A^(k)=1(Nk)1i1<<ikNA(ik)A(i1)\hat{A}^{(k)} = \frac{1}{\binom{N}{k}} \sum_{1 \leq i_1 < \cdots < i_k \leq N} A^{(i_k)} \cdots A^{(i_1)}

with A^(0)=I\hat{A}^{(0)} = I. Thus, A^(1)\hat{A}^{(1)} is the mean single-layer attention, A^(2)\hat{A}^{(2)} averages all pairwise compositions, and so forth, up to k=Nk=N. The total rollout is:

A^=k=0N(Nk)A^(k)\hat{A} = \sum_{k=0}^N \binom{N}{k} \hat{A}^{(k)}

Across LLMs with different layer counts or head numbers, the same-order OLA matrices exhibit remarkable structural similarity, particularly at low orders (notably k=1k=1) (Liang et al., 7 Nov 2025).

2. Cross-LM Commonality and Syntactic Encoding in OLA

Experiments demonstrate that for models trained on comparable objectives, the same-order OLA matrices encode highly similar context aggregation patterns, empirically validated via multiple metrics:

  • Qualitative: Visual heatmaps of A^(k)\hat{A}^{(k)} for the same sentence are aligned across CLMs (e.g., LLaMA, Qwen2, Gemma2) and MLMs (BERT, RoBERTa, ELECTRA), reflecting grammatical constructs such as subject-verb dependencies.
  • Classification: A ResNet-18 classifier trained to map OLA matrices to source sentences achieves >90% accuracy trans-LM when using first-order (k=1k=1) OLA, far surpassing rollout, IRNL, and ALTI baselines.
  • Retrieval (SSIM): Hits@1 and Hits@5 rates (matching OLA heatmaps across LMs) reach 80–97%, highest for k=1k=1.
  • Controls: Parameter shuffling or randomization ablates OLA similarity, confirming it is a property of the learned LM structure.

Furthermore, OLA encodes syntactic relationships: dependency parsers using only first-order OLA achieve >>80% UAS/72% LAS on MLMs and >>60% UAS/48% LAS on CLMs; higher kk reduces this syntactic mapping effect, and full rollout underperforms (Liang et al., 7 Nov 2025).

3. Architecture and Processing Pipeline of TOA

TOA consists of a feature extractor (adapter) that ingests OLA representations, and task-specific heads, configured as follows:

Input OLA Representation:

  • Concatenate first- and second-order OLA (2×L×L2 \times L \times L tensors).
  • Clip outlier values (>μ+3σ>\mu+3\sigma per row).
  • Apply row-sum normalization.
  • Resize to 50×5050 \times 50 (to standardize input size).
  • Mask with a strict lower-triangular matrix to avoid superficial pattern exploitation.

Adapter Backbone:

  • 1x1 convolution projects channels to 768 dimensions.
  • 5 axial-transformer layers (for CLMs) or 3 (for MLMs) process the 2×50×502 \times 50 \times 50 tensor, yielding a 768×50×50768 \times 50 \times 50 feature map.
  • Extract the diagonal as FlR768×50F_l \in \mathbb{R}^{768 \times 50} (sequence representation).

Task Heads:

  • Relation Extraction: Entity-based feature extraction, MLP, with cross-entropy loss.
  • Named Entity Recognition: BIO tagging, token-wise MLP with cross-entropy loss.
  • Dependency Parsing: Biaffine modules for head/relation, cross-entropy over heads and labels.
  • POS Tagging: Token-wise MLP.

4. Zero-Training Cross-LM Transfer Protocol

The Transferable OLA Adapter is trained once on OLA representations from a source LM, over 15 epochs with LR {1×104,3×105,1×105}\in \{1\times 10^{-4}, 3\times 10^{-5}, 1\times 10^{-5}\}. Once trained, it is frozen and directly applied on OLA representations from any target LM, with no parameter updates or task-specific adaptation.

5. Empirical Results: Transferability and Performance

Setup:

  • Tested on CLMs (Qwen2, Gemma2, Llama3.2, Llama3.1) and MLMs (BERT, RoBERTa, ELECTRA) of various sizes.
  • Evaluated on: SemEval-2010 Task 8 Relation Extraction, CoNLL-2012 NER, UD English EWT Dependency Parsing, and CoNLL-2000 POS.

Cross-Model Transfer Results:

  • Relation extraction:
    • CLMs: prompt-based zero-shot 5–18% → TOA 18–35% (gain +17–30 points).
    • MLMs: zero-shot 0–7% → TOA 17–41%.
  • NER:
    • CLMs: zero-shot 1–54% → TOA 9–55%.
    • MLMs: zero-shot 0% → TOA 9–68% F1.
  • Dependency Parsing:
    • CLMs: zero-shot UAS \sim7% → TOA 37–65%.
    • MLMs: \sim1% → 58–81%.
  • POS:
    • CLMs: zero-shot 4–60% → TOA 34–74%.
    • MLMs: \sim0.5% → 40–85%.

Cross-LM loss is minimal; e.g., BERT-large RE, self-transfer 32.3% vs. cross-BERT transfer 29.9%. TOA outperforms few-shot prompting and the CMC delta-LM method on all tasks. Ablations confirm that OLA similarity (and hence transferability) is robust to dataset and sequence length, and is lost only under parameter perturbation (Liang et al., 7 Nov 2025).

6. Interpretive Significance and Implications

The key finding is that OLAs, especially low-order terms, act as a latent universal language interface among Transformer LMs. Model-specific idiosyncrasies in architecture, depth, or parameterization are factored out at the OLA level, enabling a fixed adapter to generalize across families of models. A plausible implication is that practitioners can, by training adapters on OLA rather than hidden states or logits, rapidly transfer structured linguistic knowledge between models—without access to their weights or additional data. This strongly supports the use of OLA as a universal feature space for cross-model transfer.

Another implication is OLA’s direct mapping to syntactic structure. Since first-order OLA encodes syntactic parent-child dependencies, transfer learning via TOA inheres in both context aggregation and shallow syntactic parsing. This clarifies why improvements are most significant for structure-sensitive tasks (NER, dependency parsing) and less pronounced for tasks with ultra-local context.

7. Constraints, Limitations, and Domain of Applicability

TOA is currently effective on sentence-level tasks with moderate-length inputs (L50L \leq 50 after resizing). Performance gains diminish as input length increases or for tasks requiring semantic knowledge not readily captured in attention patterns. Robustness is contingent on the presence of trained, non-pathological attention matrices; randomized or non-Transformer models do not share OLA similarities and are out of distribution.

TOA does not update target LM parameters or leverage target LM outputs, implying all downstream improvement is constrained to what is encoded in the OLA statistics. In scenarios where downstream tasks are highly model-dependent or require long-context reasoning (requiring high-order OLA terms more sensitive to model design), the benefits may attenuate.

Table: Summary of Cross-LM TOA Transfer Improvements

Task Baseline (Zero-shot) TOA Transfer Typical Δ (pts)
Relation Extract. 0–18% 17–41% 17–30
NER 0–54% 9–68% 9–55
Dep. Parsing 1–7% 37–81% 36–80
POS Tagging 0.5–60% 34–85% 33–60

These results establish the empirical validity of TOA as a foundation for parameter-free, structure-based cross-language-model adaptation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Transferable OLA Adapter (TOA).