Spectral Editing Key Amplification (SEKA)
- SEKA is a training-free method enabling precise attention steering in transformers by editing key embeddings via spectral decomposition for focused prompt highlighting.
- It leverages low-rank projectors derived from SVD to amplify relevant and suppress irrelevant token contributions, maintaining compatibility with memory-efficient kernels like FlashAttention.
- Adaptive SEKA (AdaSEKA) extends this approach with query-adaptive multi-expert routing to dynamically tailor attention steering across various semantic tasks.
Spectral Editing Key Amplification (SEKA) is a training-free method for attention steering in transformer architectures. SEKA intervenes directly in the key embedding space using spectral decomposition techniques, allowing for low-rank, geometrically interpretable control of attention focus without modifying model weights or requiring full-matrix attention computation. This approach is particularly compatible with memory-efficient attention kernels such as FlashAttention. An adaptive extension, AdaSEKA, further introduces dynamic, query-adaptive multi-expert routing for context-specific highlighting and robust control across multiple semantic tasks (Li et al., 1 Mar 2026).
1. Motivation and Core Principles
SEKA is designed to enable efficient, precise, and geometrically grounded steering of transformer attention in scenarios such as prompt highlighting, where model focus must be directed to user-specified tokens. The principal motivations for SEKA are:
- Training-Free Steering: SEKA alters only key embeddings at inference time, avoiding fine-tuning or parameter addition.
- Pre-Attention Intervention: It modifies the key vectors before any softmax or scaling steps, ensuring no need to instantiate the full attention matrix, which is essential for memory-efficient kernels (e.g., FlashAttention).
- Low-Rank Control: Empirical analysis reveals that changes in token relevance under different prompts concentrate in a low-dimensional subspace of key embeddings. SEKA exploits spectral decomposition to target those directions.
- Model Compatibility: The method operates solely on query (Q) and key (K) embeddings, leaving all downstream (value, MLP) computations unchanged and invariant to the original model’s behavior except for the specific effect of attention steering.
2. Mathematical Foundations
The SEKA method is grounded in spectral analysis of key embedding dynamics across “neutral,” “relevant,” and “irrelevant” prompt conditions. The main steps are as follows:
- Cross-Covariance and SVD: For each layer and attention head , three sets of key embeddings are collected under neutral, relevant, and irrelevant prompts, respectively. Cross-covariance matrices and , both in , are computed and decomposed via singular value decomposition (SVD).
- Rank Selection Criteria: Using a variance retention threshold , the smallest (largest singular values of ) and (smallest singular values of ) are selected such that respective cumulative singular values exceed times the total.
- Projector Construction: Two projectors are defined: (relevant directions) and (irrelevant directions).
- Key Editing Operation: At inference, the key vector is edited as , where are scalar gains (typically to suppress irrelevant directions). The altered key vector leads to amplification or suppression of attention logits, geometrically interpreted as boosting the relevant component while controlling the irrelevant one.
3. Algorithmic Implementation and FlashAttention Compatibility
Offline Preparation
- Assemble a small set (approximately 50–200) of contrastive synthetic prompts, generating triplets for each layer-head pair.
- Compute , and their SVDs, determine and according to , and construct , .
- Evaluate each head’s average ; retain only those exceeding a threshold , restricting steering to empirically relevance-sensitive heads.
Inference Workflow
- At each attention call in layer and head , intercept the key tensor and a binary mask that marks highlighted tokens.
- For each position with , apply the editing: .
- Pass the edited to FlashAttention. As only selected key vectors are edited and no full attention matrix is constructed, SEKA maintains full compatibility and efficiency with memory-optimized attention implementations.
4. Adaptive SEKA: Query-Adaptive Multi-Expert Extension
AdaSEKA generalizes SEKA for scenarios demanding semantic adaptivity, introducing a modular “expert bank.” Each expert learns a distinct relevance subspace from a specific task (e.g., factual recall, instruction following).
Procedure Overview
- Offline: For each expert , independently replicate the SEKA SVD routine to obtain , .
- Inference:
- Extract the current query vector at each head.
- Calculate alignment scores as the normalized sum of projections onto the expert’s top singular vectors, weighted by singular values.
- Construct the dynamic projector: .
- Modify each highlighted key as .
SEKA is a special case of AdaSEKA with and . The method flexibly routes among expert subspaces, enabling contextually aware attention steering without supervision or manual task assignment.
5. Empirical Benchmarks and Performance
Comprehensive benchmarking on Qwen3-4B and Qwen3-8B, using FlashAttention, demonstrates the efficiency and effectiveness of SEKA and AdaSEKA relative to strong baselines such as PASTA. Evaluation metrics and results include:
| Benchmark | Metric | SEKA | AdaSEKA (if different) | PASTA |
|---|---|---|---|---|
| CounterFact | ES | 99.02% | — | 97.16% |
| CounterFact | PS | 98.61% | — | 96.03% |
| Bias in Bios | Top-1 Accuracy | 91.02% | — | 89.58% |
| Pronouns Changing | P.Score | 95.18% | Up to 99.52% | 95.82% |
| Lost-in-the-Middle | Mid-context EM | ≈0.60 | — | ≈0.20 |
- SEKA achieves near-perfect Efficacy and Paraphrase Scores on CounterFact and strong results on Bias in Bios.
- AdaSEKA’s multi-expert mechanism yields pronounced gains, notably in the Pronouns Changing benchmark (up to 99.52%).
- Overhead remains minimal: SEKA incurs +0.03 s/sample and +0.03 GB memory. PASTA imposes +1.03 s/sample and +23 GB memory, precluding FlashAttention compatibility. AdaSEKA adds ≈+0.27 s/sample.
Ablation studies confirm the necessity of learned spectral projectors and head selection. Random projections decrease performance by 5–10 points, and the elimination of SVD and filtering steps leads to catastrophic performance collapse (e.g., Pronoun A.P.Score drops from 90.52 to 36.95).
6. Practical Guidance for Deployment
For effective integration of SEKA and AdaSEKA, several practitioner recommendations arise:
- Hyperparameter Selection:
- Variance threshold : 0.85–0.99
- Head-selection : 0.10–0.25
- Gain : 0.2–2.5 (typically near zero)
- AdaSEKA requires only and one gain
- Data Requirements: 50–100 synthetic contrastive samples are generally sufficient to compute stable projections; additional data primarily reduces variance, not peak accuracy.
- Integration Steps:
- Precompute and store and for each selected .
- Register a hook at key vector construction or as input to FlashAttention.
- Mask and edit only the key vectors for highlighted tokens during generation.
- For long-context workloads, steer only in mid-to-late layers and for heads meeting to prevent excessive amplification.
- Multi-task settings benefit from AdaSEKA’s adaptive expert routing without manual subspace assignment.
SEKA thus provides an immediately deployable, interpretable, and computationally efficient paradigm for attention steering in contemporary LLMs, especially where hardware and latency constraints preclude the use of full-rank or retraining-based post-hoc steering methods (Li et al., 1 Mar 2026).