Positional Encodings and Program Counter

Updated 6 May 2026

Positional encoding is a mechanism that assigns unique position signals to tokens, enabling Transformers to distinguish sequence order.
Empirical studies reveal that causal Transformers can infer positions internally, achieving competitive performance even without explicit encodings.
Dynamic approaches like CoPE use context-dependent gating to adjust position tracking, offering improved abstraction and flexibility.

Positional encoding is a foundational mechanism that endows neural sequence models—especially Transformers—with the capacity to distinguish between tokens arranged in ordered sequences. The challenge arises from the underlying permutation invariance of self-attention: absent an auxiliary signal, Transformers cannot natively infer where a token sits within its context. This deficiency is typically addressed by integrating positional encoding schemes, which either inject absolute positional signals or encode relative relationships. However, recent empirical and theoretical work demonstrates that, under certain architectural and masking choices, Transformers can autonomously reconstruct position information—even in the ostensible absence of explicit encodings—by leveraging intrinsic architectural features and learned dynamics, giving rise to mechanisms effectively analogous to the “program counter” in traditional computation.

1. Traditional and Contemporary Approaches to Positional Encoding

Classical approaches to positional encoding in Transformer architectures fall into absolute and relative categories:

Absolute positional encodings (learned or sinusoidal) assign each token an index-dependent embedding vector $P(j)$ , which is added to or concatenated with the token representation. This design ensures that every token receives a unique, deterministic position signature, enabling index-based tasks such as “attend to token $i$ ” (Golovneva et al., 2024).
Relative positional encodings endow the attention mechanism with information not about absolute positions, but about the offset between tokens. This can encode invariances, such as being able to generalize from “next” to “previous” irrespective of the anchor position, achieved by incorporating $\text{relbias}(i-j)$ within the attention score computation.
ALiBi, RoPE, and other advanced schemes introduce inductive biases tailored for improved extrapolation to longer contexts or for rotational invariance.

Despite the dominance of explicit positional encoding schemes, recent works have systematically interrogated their necessity, showing that large causal Transformers trained without any overt positional encodings (“NoPos” models) track token positions internally with surprising accuracy (Haviv et al., 2022).

2. Emergence of Position Information in Causal Transformers

Empirical studies have demonstrated that causal (decoder-only, left-to-right) Transformers learn to internalize position information in the total absence of explicit encoding features.

For instance, the “NoPos” variant (i.e., a causal Transformer without any position-dependent input or bias) achieves near-parity in perplexity on both WikiText-103 and The Pile compared to position-aware baselines. Representative results for a 1.3B parameter model on The Pile show: NoPos perplexity 13.10, Learned 13.05, Sinusoidal 12.93, and ALiBi 12.51. Moreover, scaling model size reduces performance gaps further, and extending context length up to 2048 tokens yields persistent gaps ≲0.2 in perplexity between NoPos and explicit methods (except for modest ALiBi advantage at extreme lengths) (Haviv et al., 2022).

Layerwise probing with a frozen, fully trained model reveals that absolute position can be decoded from intermediate hidden states: after only four layers, mean absolute error in predicted positions falls below 50 tokens (in a 0–1023 range); by six layers, the error reaches ≈10, matching models with explicit encodings. Peak signal occurs mid-network (Layers 8–12), waning by the output.

These findings indicate that position information is reconstructed and maintained internally, even when never explicitly presented, solely by virtue of the causal mask and scale of training (Haviv et al., 2022).

3. Mechanisms Underlying Implicit Position Tracking (“Program Counter” Effect)

The primary driver of implicit position tracking in causal Transformers is the combination of masking and the network’s representational capacity:

In standard causal self-attention, the triangular mask $M_{ij}$ enforces that token $i$ can only attend to positions $j \leq i$ , i.e., each position “sees” only its predecessors.
This induces a gradient in available receptive field: token $i$ ’s representation is dependent on exactly $i$ predecessors, and attention distributions can, in principle, recognize this feature (Haviv et al., 2022).
By summing signals (such as a constant value across the unmasked region in a given attention head), a model can propagate a stable incrementor—or “counter”—that tracks how many tokens have been processed. Subsequent layers can normalize or refine this counter into a robust position signal, analogous to the program counter (PC) in classic von Neumann architectures (where PC increments by one per instruction fetch) (Haviv et al., 2022).

Bidirectional (masked LLM) architectures, lacking directional masking, fail to develop position-tracking: without the asymmetry, tokens have no structural cue for sequence direction, resulting in catastrophic perplexity ( $>140$ vs. $\sim 4$ for causal models) when no positional signal is provided (Haviv et al., 2022).

4. Geometric and Similarity-Based Position Induction

Recent theoretical and experimental work reveals that the geometry of representations in random and trained causal Transformers inherently encodes position information by adjacency of embedding similarity (Zuo et al., 2024):

In the first self-attention layer, the output vector for token $i$ 0 is a convex combination of the first $i$ 1 input embeddings, while token $i$ 2’s output blends in the subsequent $i$ 3. This construction ensures that adjacent tokens yield output vectors with higher pairwise similarity than tokens separated by larger distances.
Cosine similarity matrices $i$ 4 exhibit a strong banded structure: nearest neighbors along the sequence have maximal similarity, decaying monotonically with increasing separation $i$ 5.
This “adjacency” effect is universally observed—appearing immediately after random initialization and persisting after training, across all tested layer counts and embedding dimensionalities. Rowwise adjacency probability $i$ 6 approaches 0.97–0.99 in early layers and remains $i$ 7 in deeper layers.
In effect, absolute position can be inferred by sorting a set of representation vectors according to their proximity under cosine similarity, reconstructing the original ordering without access to explicit position signals (Zuo et al., 2024).

This intrinsic banded geometry functions as an implicit program counter, where the “distance to the diagonal” in similarity space corresponds to the distance along the sequence.

5. Contextual and Dynamic Generalizations: Contextual Position Encoding (CoPE)

While most positional encoding schemes unconditionally increment by one per token or per position, Contextual Position Encoding (CoPE) introduces a data-dependent, context-conditioned position counter (Golovneva et al., 2024):

CoPE defines a gating variable $i$ 8, deciding for each position whether a preceding token should increment the counter.
The effective “distance” $i$ 9 from $\text{relbias}(i-j)$ 0 to $\text{relbias}(i-j)$ 1 is computed by summing over selected $\text{relbias}(i-j)$ 2 from $\text{relbias}(i-j)$ 3 to $\text{relbias}(i-j)$ 4, yielding both integer and fractional counters:

$\text{relbias}(i-j)$ 5

Position embeddings are interpolated as $\text{relbias}(i-j)$ 6, and these are injected into the attention computation:

$\text{relbias}(i-j)$ 7

In streaming form, the update $\text{relbias}(i-j)$ 8 (where $\text{relbias}(i-j)$ 9) mirrors a dynamic program counter that increments conditionally on context and history.

CoPE permits position tracking not only at the token level but also at higher semantic units: the model can learn to count words, sentences, or structure-defined regions by setting the gate to increment only at appropriate points. Empirically, CoPE demonstrates improved capability on selective copy, counting, Flip-Flop, and generalization tasks compared to classical absolute and relative schemes, and matches or outperforms them on language and code modeling tasks.

6. Comparative Performance of Implicit and Explicit Position Encodings

The table below summarizes representative results for variant models on core tasks and datasets, with and without explicit position encoding:

Model Type	Dataset	Perplexity / Error Rate	Notes
NoPos (causal)	WikiText-103	20.97	Comparable to learned, sinusoidal, ALiBi
NoPos (causal)	The Pile	13.10	Performance gap shrinks with larger models
Bidirectional	The Pile	$M_{ij}$ 0140 (w/o PE)	Fails catastrophically without PE
Absolute PE	Flip-Flop	6.8% / 21.7% (ID/OOD)	Fails OOD generalization
CoPE	Flip-Flop	0.0% / 4.9% (ID/OOD)	Robust to length/structure shift
Absolute PE	WikiText-103	24.87	CoPE outperforms (23.46)
Relative PE	Counting vars	1.1% (1 var), 22.4% (5 vars)	CoPE lower error, better OOD generalization

Performance is consistently robust for causal models without explicit encoding, but bidirectional models strictly require positional signal. Dynamic encodings like CoPE generalize strongest to challenging counting and abstraction tasks (Haviv et al., 2022, Golovneva et al., 2024).

7. Theoretical and Practical Implications

The convergence of empirical, geometric, and algorithmic perspectives supports the following synthesis:

Positional information emerges robustly and consistently in causal Transformers due to the inductive bias of sequential masking, large dimensionality, and recurrent averaging of contextual information.
The mask-induced program counter effect renders explicit positional encoding sufficient but not necessary for many left-to-right modeling settings.
Practical implications include the possibility of training large GPT-style models without overt position encodings, with little or no loss of performance; this may streamline model design or reduce reliance on special initialization (Haviv et al., 2022, Zuo et al., 2024).
Dynamic schemes (such as CoPE) extend this paradigm, equipping models with program counters that operate not only at token granularity but across contextually defined abstractions, supporting flexible selection of what counts as “next” (Golovneva et al., 2024).

A plausible implication is that principled program counter mechanisms—whether implicit (via mask and geometry) or explicit (via contextualized gating)—are fundamental to endowing sequence models with index sensitivity, generalization ability, and abstraction power. These findings inform the ongoing evolution of Transformer architectures, with broad relevance to long-context processing, structured reasoning, and representation learning.