Papers
Topics
Authors
Recent
Search
2000 character limit reached

Adaptive SEKA: Dynamic Attention Steering

Updated 4 March 2026
  • Adaptive SEKA is a training-free, query-adaptive framework for attention steering in transformer models that dynamically assembles subspaces based on prompt semantics.
  • It extends SEKA by leveraging SVD-derived expert subspaces to modulate key embeddings without explicitly constructing the full attention matrix.
  • The method achieves state-of-the-art prompt-highlighting performance with minimal latency and memory overhead, ensuring compatibility with efficient kernels like FlashAttention.

Adaptive SEKA (AdaSEKA) is a training-free, query-adaptive framework for attention steering in transformer-based LLMs. Developed as an extension of Spectral Editing Key Amplification (SEKA), AdaSEKA enables dynamic modulation of attention by adaptively routing the model’s focus across multiple specialized subspaces according to the semantic intent of prompts. The method directly edits key embeddings prior to attention computation, circumventing the need for explicit construction of the full attention matrix and maintaining compatibility with memory- and latency-efficient implementations, notably FlashAttention. AdaSEKA achieves significant performance gains on standard prompt-highlighting tasks while incurring minimal latency and memory overhead (Li et al., 1 Mar 2026).

1. Foundations and Motivation

Attention steering addresses the control of transformer model focus, facilitating prompt highlighting wherein the model prioritizes specific user-designated text segments. Existing post-hoc attention-steering approaches typically require forming the complete T×TT \times T attention matrix, resulting in prohibitive memory and computational costs incompatible with efficient attention kernels. SEKA introduces a spectral decomposition-based solution that obviates these limitations by operating directly in key embedding space.

AdaSEKA generalizes SEKA to accommodate the diverse “styles of relevance” invoked by varying prompt types (e.g., instruction following, factual recall, bias correction). Instead of static subspace projections, AdaSEKA adaptively assembles dynamic projectors by combining outputs from multiple SVD-derived expert subspaces, based on a routing mechanism responsive to prompt semantics.

2. Technical Methodology

Spectral Editing Key Amplification (SEKA)

SEKA operates in two phases:

  1. Offline SVD Subspace Discovery: For each transformer layer ll and attention head hh, triplets of key embeddings are extracted from contrastive prompts—corresponding to neutral, positive, and negative relevance contexts. Cross-covariance matrices Σl,h+,Σl,h\Sigma^{+}_{l,h}, \Sigma^{-}_{l,h} are constructed for positive and negative contexts, respectively. Singular value decompositions yield orthonormal projectors onto “relevance” (Pl,h+P^+_{l,h}) and “irrelevance” (Pl,hP^-_{l,h}) subspaces, selected based on variance retention criteria.
  2. Inference-Time Application: Keys kjk_j for highlighted tokens are affinely transformed as kj=kj+g+Pl,h+kj+gPl,hkjk'_j = k_j + g^+ P^+_{l,h} k_j + g^- P^-_{l,h} k_j, amplifying or suppressing their attention scores relative to query vectors. This low-rank operation is both memory- and compute-efficient, never requiring explicit construction of the attention matrix.

Adaptive SEKA (AdaSEKA)

AdaSEKA builds MM experts, each corresponding to an SVD-derived subspace bank trained from task-specific datasets. At inference:

  • The query vector qL,hq_{L,h} from the last prompt token is compared to the top-KK singular vectors of each expert via an alignment score αm,l,h(q)\alpha_{m,l,h}(q). This normalized score lies in [1,1][-1,1] and modulates each expert’s contribution.
  • The dynamic projector Pl,hdyn(q)=m=1Mαm,l,h(q)Ul,h+,m: ⁣,1:K(Ul,h+,m: ⁣,1:K)P^{\rm dyn}_{l,h}(q) = \sum_{m=1}^M \alpha_{m,l,h}(q) U^{+,m}_{l,h}{}_{:\!,1:K}\left(U^{+,m}_{l,h}{}_{:\!,1:K}\right)^\top is constructed.
  • Each highlighted key undergoes transformation kj=kj+gPl,hdyn(q)kjk'_j = k_j + g\,P^{\rm dyn}_{l,h}(q) k_j, effecting query-adaptive attention steering without any gradient-based optimization.

The routing mechanism is mathematically stable and interpretable. No further regularization is required beyond alignment normalization.

3. Computational Efficiency and Implementation

SEKA and AdaSEKA are designed for high efficiency in both latency and memory usage:

Method Latency Increase (Qwen3-8B) Memory Overhead
SEKA +0.03 s O(H r d) (few MB)
AdaSEKA +0.27 s O(L H M d K) (modest)
PASTA (post-hoc baseline) +1.03 s Doubles memory
  • Standard attention costs O(LHT2d)O(LH\,T^2d) computation and O(T2)O(T^2) memory.
  • SEKA and AdaSEKA add only O(HrdT)O(H\,r\,d\,T) and O(HMKd)O(H\,M\,K\,d) additional operations, respectively, and minimal parameter storage.
  • AdaSEKA maintains compatibility with FlashAttention, in contrast to approaches like PASTA that require explicit attention matrices and are not practical for large-scale or optimized inference (Li et al., 1 Mar 2026).

4. Empirical Results

AdaSEKA and SEKA attain state-of-the-art results on established prompt-highlighting benchmarks (CounterFact, Bias in Bios, Pronoun Changing):

Task (Qwen3-4B) Baseline SEKA AdaSEKA
CounterFact Efficacy (%) 57.70 99.02 98.90
Bias in Bios Accuracy (%) 82.94 91.02 91.86
Pronoun Overlap (%) 95.76 95.18 94.54
All-pronouns (%) 93.88 -- 92.08
  • In lost-in-the-middle recall, SEKA applied to central passage segments inverts the standard U-shaped recall profile, elevating central exact match from 0.3\sim 0.3 to $0.5$–$0.65$. PASTA does not effect a comparable improvement.
  • Ablation of adaptive routing (i.e., using only a single static expert) results in 5–10 point drops in performance, demonstrating the importance of dynamic subspace combination (Li et al., 1 Mar 2026).

5. Hyperparameter Recommendations

Empirical tuning yields robust configurations across tasks and architectures:

  • Number of experts (MM): 4 (corresponding to e.g., factual recall, instruction following, multi-hop QA, bias correction) is found to balance diversity and overhead.
  • Subspace dimension (KK): K=5K=5 sufficient; performance saturates beyond K8K\approx8.
  • Variance threshold (γ\gamma): $0.85$–$0.99$ (typically $0.95$) for subspace selection.
  • Head selection threshold (δ\delta): $0.1$–$0.3$ for optimal head participation (30–60 heads/model).
  • Steering gain (gg, g+g^+, gg^-): Gains in [0.2,1.5][0.2, 1.5] avoid instability; g0.5g\approx0.5 is robust for AdaSEKA.
  • Contrastive sample size: Only 50–100 samples per projector are required for stable SVD estimation.

These settings collectively ensure AdaSEKA achieves effective, interpretable, and computationally efficient attention steering (Li et al., 1 Mar 2026).

6. Impact and Significance

AdaSEKA establishes a principled, lightweight paradigm for prompt-conditioned attention steering in LLMs. Its compatibility with memory-efficient attention kernels such as FlashAttention, coupled with training-free operation and strong empirical performance, addresses central challenges in controllable language modeling. Because AdaSEKA provides interpretable routing among multiple semantically specialized subspaces, it offers a mechanism to tune model focus in a task- and prompt-adaptive manner, with implications for bias mitigation, information retrieval, and multi-style response generation.

A plausible implication is that AdaSEKA’s modularity and minimal overhead make it adaptable to a broad range of transformer architectures and control desiderata. Empirical findings suggest further exploration of expert subspace diversity and adaptive routing criteria may enhance generalization and task-specific controllability (Li et al., 1 Mar 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Adaptive SEKA (AdaSEKA).