Relative Positional Bias in Neural Models

Updated 11 March 2026

Relative positional bias is the systematic influence of the relative arrangement of sequence elements on model output, independent of their fixed positions.
Empirical studies reveal that this bias can degrade performance by up to 40% in tasks like retrieval, QA, and classification in long-context models.
Mitigation strategies such as adapter-based fine-tuning, calibrated attention, and refined positional encodings help improve fairness and generalization.

Relative positional bias refers to systematic, content-independent preferences or performance shifts by neural sequence models—most notably Transformers—arising from the relative arrangement or spacing of elements within an input. This phenomenon manifests across modalities, architectures, and task paradigms, impacting both core model inference (retrieval, reasoning, QA, classification) and secondary usage (embedding, judging, ranking). A comprehensive understanding of relative positional bias is critical for both theoretical insight and for the design of architectures and evaluation protocols optimized for long and variable-length inputs.

1. Formal Definitions and Measurement

Relative positional bias is distinguished from absolute positional bias. While absolute positional bias quantifies a model's tendency to favor elements at certain fixed locations (e.g., "beginning," "^{^{^{^{1^{^{^{^"}}}}}}} "end") (Veseli et al., 10 Aug 2025, Mikhail et al., 22 May 2025, Amor et al., 2023), relative positional bias is defined by the functional dependence of model output on the mutual distances or arrangements between items, independent of their fixed coordinates (Tian et al., 2024, Shinoda et al., 2022).

For a set of $n$ relevant items placed within a context of length $L$ , relative distance can be quantified as:

$\text{Distance}(\ell) = \frac{L}{n-1} \cdot \frac{\ell-1}{N-1}$

where $\ell$ indexes spacing levels among $N$ configurations (Tian et al., 2024). Relative positional bias is then measured by the degradation in performance metrics $M(\ell)$ (e.g., recall, accuracy) as $\ell$ increases:

$\mathrm{bias}_{\mathrm{rel}}(d) = 1 - \frac{M(d)}{M(1)}$

This definition generalizes to settings involving multi-piece retrieval, QA, classification, and beyond.

Pairwise and listwise evaluation of relative positional bias commonly employ position-swapping protocols, with metrics such as Position Consistency (PC) and Preference Fairness (PF) (Shi et al., 2024, Labruna et al., 30 Jun 2025). For instance, in binary-choice formats:

$\mathrm{PC} = \frac{1}{N}\sum_{i=1}^N \mathbb{I}[C_i^{(1)} = C_i^{(2)}]$

$|\mathrm{PF}| \in [0, 1]$

where $C_i^{(1)}$ and $C_i^{(2)}$ are model choices under swapped prompt positions (Labruna et al., 30 Jun 2025).

In extractive QA, the relative offset $d$ is computed as the signed distance from the answer span to the nearest context–question overlapping token, capturing the model's exploitation of relative linguistic cues (Shinoda et al., 2022).

2. Empirical Manifestations Across Tasks and Models

Relative positional bias is pervasive, influencing performance in long-context LLMs (Tian et al., 2024), token classification (Amor et al., 2023), extractive QA (Shinoda et al., 2022), financial decision-making (Dimino et al., 25 Aug 2025), document embedding (Schuhmacher et al., 23 Jan 2026), multilingual QA (Mikhail et al., 22 May 2025), and model-judging (Shi et al., 2024). Salient empirical patterns include:

Lost-in-the-middle (LiM) effect: A U-shaped absolute bias is prominent as long as the relevant snippet occupies up to half the context window; as the context saturates, primacy bias collapses, recency bias remains, and a pure distance-based gradient emerges (Veseli et al., 10 Aug 2025).
Spacing of relevant segments: In multi-piece retrieval, models show sharp performance drops as average pairwise distances increase. Models that are robust to absolute LiM effects still suffer when multiple relevant pieces are distributed (Tian et al., 2024).
Pairwise preferences: Binary-choice QA and judger systems display elevated positional bias—favoring a position regardless of content—especially under high uncertainty, as measured by PF and PC (Labruna et al., 30 Jun 2025, Shi et al., 2024).
Mechanistic loci: Attribution studies in financial LLMs identify mid-to-late attention layers and specific heads as the "bias engines" responsible for relative position effects (Dimino et al., 25 Aug 2025).
Cross-linguistic regularities: Multilingual LLMs may exhibit model-driven preference for late or early positions, but not necessarily in line with the language’s dominant word-order or structure (Mikhail et al., 22 May 2025).

Typical quantitative effects span a performance degradation of up to 40% in recall for sparser information on long inputs (Tian et al., 2024), accuracy/F1 drops of 3–9% for classifier tokens out-of-distro position (Amor et al., 2023), and PF up to 0.75 in high-uncertainty, human-preference tasks (Labruna et al., 30 Jun 2025).

3. Architectural Origins: Masks, Kernels, and Encodings

Transformer architecture is intrinsically permutation-invariant; sequential inductive bias is imparted through explicit position encoding or biasing mechanisms (Angelotti, 2023, 2502.01951). Core architectural contributors include:

Causal Masks: Introduce a directional bias (favoring early tokens) due to increasing context aggregation depth; deeper models amplify earliest positions exponentially (2502.01951).
Relative Positional Encodings:
- Decay Masks (ALiBi): Linear decay $a_{i,j} = -m(j-i)$ , shifting attention to more recent positions. The ALiBi bias is generalizable to longer contexts and reduces depth-amplified primacy (2502.01951, Gao, 2024).
- Rotary PE (RoPE): Imparts a quadratic decay per layer to relative distance $d$ . The decay is milder than ALiBi and sustains longer-range dependencies, but still subject to aggregate masking effects (2502.01951, Amor et al., 2023).
- HyPE: Hyperbolic function-based bias $b_{i,j} = -\tau \sinh(\mu(j-i))$ , approximating ALiBi at small $\mu$ / $\tau$ and implemented efficiently via concatenated low-dimensional augmentations to Q/K matrices, fully differentiable and compatible with fused attention kernels (Angelotti, 2023).
- MEP: Multiple-kernel mixtures (exponential, Gaussian, log-polynomial), with log-mixture bias $b(d) = \log[\sum_i{\alpha_i K_i(d)}]$ . This produces gentler long-range decay, facilitating length extrapolation (Gao, 2024).
- FastRPB: Uses a learnable Toeplitz kernel over relative positions, computed as an FFT-based convolution for O( $N\log N$ ) time and O( $N$ ) memory, decoupled from particular attention implementations (Zubkov et al., 2022).
Embedding Pooling and Attention Calibration: Encoder-pooling schemes using a special token (e.g., <s>) are subject to front-loaded attention distributions, yielding overrepresentation of early segments in downstream representations. Post-hoc basketized attention calibration at inference can nearly eliminate aggregate position bias (Schuhmacher et al., 23 Jan 2026).
Data Distribution: Token classification tasks reveal that the imbalance in position of class-positive tokens during pretraining predisposes models to systematic bias, independent of attention design (Amor et al., 2023).

4. Mitigation Strategies and Debiasing Methods

A spectrum of effective debiasing methods are established in the literature:

Data-level Augmentation: Random position permutation (Zhang et al., 2024), random position shifting (Amor et al., 2023), context perturbation, and balanced batch sampling expose models to wider positional distributions.
Adapter-based and Efficient Fine-tuning: Position-Aware Parameter Efficient Fine-Tuning (PAPEFT) uses a lightweight, learnable location-encoding adapter to inject position-uniformity, sharply reducing fluctuation while maintaining or increasing accuracy (Zhang et al., 2024).
Ensemble and Product-of-Experts: For extractive QA, integrating an explicit position-only expert in ensemble with the main model using product or learned-mixin gating can force the model to rely on semantic content and mitigate over-reliance on superficial positional statistics (Shinoda et al., 2022).
Head-level Regularization: Attributing and penalizing key attention heads ("bias engines") identified via head ablation or direct logit attribution, especially in high-stakes domains such as finance (Dimino et al., 25 Aug 2025).
Inference-time Calibration: Rebalancing token-level attention weights post-hoc (e.g., basketized equal-mass scheme) without retraining, used effectively for long-document pooling (Schuhmacher et al., 23 Jan 2026).
Prompt and Evaluation Design: Balanced randomization of candidate ordering at inference or evaluation, and aggregation over swapped orders, directly cancels systematic position effects in ranking and judging tasks (Shi et al., 2024, Labruna et al., 30 Jun 2025).

Empirically, these interventions lower positional fluctuation by 54–59%, boost accuracy by 57–64%, and mitigate F1 and agreement drop in out-of-distribution or adversarial settings (Zhang et al., 2024, Shinoda et al., 2022, Amor et al., 2023).

5. Analysis of Interactions and Contextual Effects

Positional bias is shaped both by architecture and by context/task variables:

Uncertainty Amplification: Bias is magnified exponentially as task uncertainty increases, being minimal when a ground-truth answer is clear and maximal in subjective or ambiguous settings (Labruna et al., 30 Jun 2025). Tightly matched answer pairs or highly subjective judgment tasks are especially vulnerable (Shi et al., 2024).
Capacity and Scaling: Larger model scale attenuates positional effects, but does not fully eliminate them, suggesting intrinsic biases persist even in high-capacity models (Dimino et al., 25 Aug 2025, Tian et al., 2024).
Prompt Framing: Explicit positional guidance (e.g., highlighting the “correct” context) often decreases accuracy and exacerbates bias, while prompt-agnostic or randomized ordering minimizes it (Mikhail et al., 22 May 2025, Shi et al., 2024).
Cross-lingual Variability: Different LLMs exhibit top vs. bottom preference as a function of model, with some architectures (e.g., Qwen2.5-7B) favoring late-context positions, challenging the assumption of universal primacy (Mikhail et al., 22 May 2025).
Retrieval-Reasonsing Cascade: Retrieval performance determines reasoning bias; when relevant facts are successfully retrieved, position-biased reasoning effects vanish (Veseli et al., 10 Aug 2025).

6. Implications for Model Evaluation, Benchmarking, and Design

A rigorous approach to characterization and mitigation of relative positional bias carries several immediate implications:

Benchmarking Recommendations: Evaluation should report position-dependent performance as a function of both absolute and relative input length ( $L_{\text{rel}}$ ), include multi-piece relative benchmarks, and assess metrics across various positions and spacing (Veseli et al., 10 Aug 2025, Tian et al., 2024).
Long-Context Generalization: Position-encoding strategies should be explicitly designed for extrapolation, favoring bias kernels or mixtures with slow, smooth decay (e.g., HyPE, MEP) (Angelotti, 2023, Gao, 2024).
Judging and Candidate Ranking: Adopting swap-averaging and multi-judge aggregation are critical for fair, reproducible results in model evaluation, especially for ambiguous, low-quality-gap instances (Shi et al., 2024).
Embedding Fairness: For tasks relying on document-level pooling, attention calibration to ensure uniform segment representation is necessary for downstream coverage and retrieval equivalence (Schuhmacher et al., 23 Jan 2026).
Continual Monitoring: Production deployments in sensitive domains (e.g., financial LLMs) should continually audit bias metrics such as $\Delta_{i,c}$ , PC, and PF under evolving inputs and prompt structures (Dimino et al., 25 Aug 2025).

Advanced mitigation may require integrating fairness objectives into pretraining or fine-tuning, and designing architectures capable of data-dependent or hybrid masking.

Relative positional bias is a central, multi-faceted phenomenon in contemporary neural sequence models, with measurable and sometimes severe consequences for performance, generalizability, and fairness. Comprehensively understanding its mathematical foundations, empirical characteristics, architectural origins, and mitigation strategies is essential for both theoretical analysis and the practical deployment of robust long-context, multi-item, and multilingual AI systems (Angelotti, 2023, Tian et al., 2024, 2502.01951, Zhang et al., 2024, Shinoda et al., 2022, Amor et al., 2023, Schuhmacher et al., 23 Jan 2026, Mikhail et al., 22 May 2025, Shi et al., 2024, Labruna et al., 30 Jun 2025, Gao, 2024, Zubkov et al., 2022).