Bilevel Positional Encoding (BiPE)

Updated 24 November 2025

BiPE is a hierarchical positional encoding framework that disentangles local (intra-segment) and global (inter-segment) positions in sequence data.
It combines absolute positional encoding for short-range dependencies with relative positional encoding for longer-range, segmented relationships.
Empirical results show BiPE significantly improves extrapolation performance on tasks like arithmetic and language modeling while maintaining in-distribution efficiency.

Bilevel Positional Encoding (BiPE) is a positional encoding framework designed to enhance Transformer-based models’ ability to extrapolate to much longer sequences than those encountered in training. By leveraging the intrinsic modular segmentation present in natural and structured sequences—such as sentences in text or steps in mathematical proofs—BiPE disentangles local, intra-segment position (“where am I within my segment?”) from global, inter-segment position (“which segment am I in, and how far are two segments apart?”). This separation is operationalized through a combination of absolute positional encoding (APE) within segments and relative positional encoding (RPE) across segments, yielding a representation that aligns with the hierarchical structure of sequence data and substantially improves length extrapolation, without compromising within-distribution performance (He et al., 2024).

1. Motivation and Conceptual Foundation

Standard APE schemes (e.g., Vaswani et al., 2017) assign a unique embedding or learned/parameterized vector to each token index $i$ . However, this approach does not enable generalization to positions $i > L_{\text{train}}$ , as those indices are never observed during training. RPE methods, such as RoPE, ALiBi, and T5-RPE, encode only the distance $i - j$ between tokens, thereby mitigating but not eliminating extrapolation failures. These limitations often necessitate ad hoc fine-tuning interventions (e.g., YaRN, positional interpolation) and can result in collapsed attention patterns at large distances.

Linguistic and structural empirical analyses (e.g., PG-19 data) reveal that sequence segmentation—into sentences, paragraphs, or code blocks—persists across scales: the distribution of tokens per segment remains stable as document length increases, while the number of segments grows linearly. This suggests that “length extrapolation” in practice equates to extrapolating the number of segments $N$ rather than segment-internal position $M$ . BiPE therefore posits that intra-segment position can be encoded absolutely (local anchor for syntax and short-range dependencies), while inter-segment relationships are better captured via RPE—providing a natural match to hierarchical sequence organization.

2. Mathematical Construction

Given a sequence $S = w_1, \dots, w_L$ , partitioned as $S = S_1 \oplus S_2 \oplus \cdots \oplus S_N$ with segment $S_n = [w_{a_n}, \dots, w_{b_n}]$ , each token $w_l$ is assigned:

$n(l)$ : the segment index,
$i(l) = l - a_{n(l)} + 1$ : its local, intra-segment position.

2.1 Intra-Segment Encoding

For the local position $i \in [1, M]$ (maximum segment length $M$ ), the standard absolute positional encoding is used, such as sinusoidal encoding: $[\phi_{\text{intra}}(i)]_{2k} = \sin\left(\frac{i}{10000^{2k/d}}\right), \quad [\phi_{\text{intra}}(i)]_{2k+1} = \cos\left(\frac{i}{10000^{2k/d}}\right)$ with $k=0,\dots,d/2-1$ , or learned embeddings $\phi_{\text{intra}}(i)$ .

2.2 Inter-Segment Encoding

The segment index $n$ is encoded with a relative positional encoding:

RoPE: Rotates query/key representations by $R(\theta \cdot n)$ . For queries/keys $q_l, k_m \in \mathbb{R}^d$ ,

$q_l^R = R(\theta \cdot n(l)) q_l, \quad k_m^R = R(\theta \cdot n(m)) k_m$

resulting in attention scores:

$\text{score}(l,m) = \frac{q_l \cdot R(\theta(n(l) - n(m))) k_m}{\sqrt{d}}$

ALiBi: Adds a scalar bias $r|n(l) - n(m)|$ to query-key dot-products.
T5-RPE: Learns scalars per head and segment difference, adding $r_h[\Delta]$ to attention logits.

2.3 Combined Encoding

The overall embedding at position $l$ is given by

$z_l = \text{TokenEmb}(w_l) + \phi_{\text{intra}}(i(l))$

with inter-segment biases or rotations applied during self-attention. Equivalently, the overall positional signal is

$\phi_{\text{BiPE}}(l) = \phi_{\text{intra}}(i(l)) \oplus \phi_{\text{inter}}(n(l))$

with additive (sum or concatenation) fusion at the embedding stage for intra-segment and at attention for inter-segment (He et al., 2024).

3. Theoretical Analysis

The analysis demonstrates that BiPE yields superior representational and parameter efficiency in hierarchical settings. The Bi-level Non-deterministic Finite Automaton (Bi-NFA) framework models input with a two-level structure: local state transitions within segments and jumps to new segment blocks via a special separator symbol.

Flat APE (Theorem 3.1): Simulating a flat NFA of state set $Q$ requires $\Omega(|Q|^2)$ embedding dimensions for a Transformer with flat APE, as distinguishing all transition functions $f: Q \to 2^Q$ would otherwise require collisions in embeddings and loss of expressivity.
BiPE for Bi-NFA (Theorem 3.2): BiPE can represent a Bi-NFA with $k$ segments and state sets $Q_1, \dots, Q_k$ using $O(k^2 + \sum_{i=1}^k |Q_i|^2)$ embedding dims, exploiting local composition (intra-segment, $\sum|Q_i|^2$ ) and global transitions (inter-segment, $O(k^2)$ ) hierarchically. Since typically $k \ll \sum|Q_i|$ , this can be much more efficient than a flat embedding approach.

This analysis suggests BiPE’s hierarchical decomposition confers substantive sample and parameter efficiency when representing structured, segmented sequence data.

4. Integration in Transformer Architectures

BiPE is designed for compatibility with standard Transformer architectures:

Input Embeddings: Standard token embedding plus $\phi_{\text{intra}}(i(l))$ ; no change to the embedding layer except addition of the intra-segment encoding.
Attention Mechanism:
- With RoPE, rotary matrices are applied per-token based on segment index, i.e., $R(\theta \cdot n(l))$ .
- With ALiBi, a segment-level distance bias is added to attention scores.
- All other model components (layer normalizations, MLPs, residuals) remain unchanged.
Hyperparameters:
- Segmentation: Detected through natural boundaries—full stops (“.”) and newlines (“\n”).
- Intra-segment length $M$ : Typically $M \lesssim 100$ , requiring only a small lookup or learned table.
- Embedding dimension $d$ : Same as the base model (e.g., $d=768$ ).
- Attention head biases: For ALiBi, slopes per head ( $r_h$ ) are scaled to amplify segment bias (e.g., $\times 96$ for BiPE-ALiBi).
No additional normalization layers or structural changes are introduced (He et al., 2024).

5. Empirical Results

BiPE’s effectiveness is demonstrated on an array of language and arithmetic modeling benchmarks, with particular focus on length extrapolation performance.

Task/Setting	Baseline Perf.	BiPE-ALiBi	BiPE-RoPE
Arithmetic (hidden=48)	<70% (sin, RoPE, ALiBi)	97%	95%
LM Extrapolation (PG-19, L=8192)	RoPE: 158 ppl; ALiBi: 28.6	25.2	≈19.7 (@ L=4096)
YaRN FT (L=20K, RoPE crash at ~11K)	—	—	$\lesssim$ 10 ppl
SCROLLS (avg. score)	RoPE: 18.38	18.34	22.36 (+3.98)
SCROLLS + YaRN	23.01 (RoPE+YaRN)	—	24.53

Arithmetic (chain-of-thought): BiPE variants significantly outperform others (e.g., 97% vs. <70% for ALiBi).
Language Modeling Length Extrapolation: BiPE-ALiBi and BiPE-RoPE reduce perplexity markedly at extreme lengths, maintaining performance where RoPE collapses.
SCROLLS Benchmark: BiPE-RoPE achieves a +3.98 average score improvement over RoPE; with YaRN finetuning, BiPE further raises scores.
In-Distribution NLP (RACE, WinoGrande, TruthfulQA, PIQA, HellaSwag, MMLU): BiPE is on par with standard APE/RPE baselines.
Ablation Studies: Removing intra- or inter-segment encoding sharply degrades extrapolation; fixed-length windows as segments consistently underperform natural boundaries.

These results underscore the practical advantages of BiPE in both zero-shot and fine-tuning-driven long-context settings (He et al., 2024).

6. Limitations and Practical Considerations

BiPE entails minimal additional computational or memory overhead beyond a lookup/table for $\phi_{\text{intra}}$ and standard RPE operations. Sensitivity to segment definition is critical: relying on natural sequence boundaries (e.g., sentences) yields robust performance, whereas arbitrary fixed-length segmentation is suboptimal. The required memory for intra-segment tables is small ( $\lesssim M \times d$ ).

Extensions such as hierarchical positional encoding with more than two granularity levels (e.g., sentence $\rightarrow$ paragraph $\rightarrow$ document) are suggested, as is the need for automatic segment detection (especially in domains like time-series or genomics that lack clear boundaries).

A plausible implication is that disentangling intra- and inter-segment positional structure represents a principled inductive bias for hierarchical data, with theoretical and empirical support for efficient, scalable modeling of long and modular sequences. Further research may explore automatic, data-driven segmentation and additional hierarchy levels (He et al., 2024).

Markdown Upgrade to Chat

References (1)

Two Stones Hit One Bird: Bilevel Positional Encoding for Better Length Extrapolation (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Bilevel Positional Encoding (BiPE).