Trajectory Geometry of Transformers

Updated 9 June 2026

The paper reveals that transformer representations form dynamic trajectories through high-dimensional geometry, reflecting evolving semantic and computational processes.
It employs spherical geometry, curvature metrics, and low-dimensional analysis to quantify non-linear paths and disambiguation across layers.
Empirical findings demonstrate distinct layer-wise phase transitions and semantic convergence, offering actionable insights for model interpretability.

Trajectory geometry of transformer representations refers to the study of how the sequence of hidden-state vectors produced by a transformer model, as a function of depth (layer index) or sequence position, evolves through high-dimensional embedding space. This framework draws on both formal geometry and dynamical-systems theory to analyze the internal computations underlying model behavior. It reveals that trajectories in representation space encode not only static semantic features, but also the unfolding computational processes by which transformers synthesize, refine, and ultimately contract information en route to output predictions. Contemporary research leverages a diverse toolkit—spanning Euclidean and spherical geometry, population dynamics, intrinsic-dimension analysis, and alignment with human and predictive baselines—to rigorously characterize these trajectories across architectures, tasks, and modalities.

1. Geometric Frameworks for Layerwise Transformer Trajectories

A transformer model with $L$ layers and hidden-dimension $d$ represents a sequence (or prompt) via a series of hidden-state matrices $H^{(l)} \in \mathbb{R}^{n \times d}$ , where $l = 0, \dots, L$ and $n$ is sequence length. By extracting either token-wise or mean-pooled representations at each $l$ , one obtains a discrete trajectory

$\tau = (h^{(0)}, h^{(1)}, \dots, h^{(L)}) \in (\mathbb{R}^d )^{L+1}$

which traces a path through representation space as the forward pass unfolds. Several geometric approaches offer complementary perspectives:

Spherical Geometry: Layer normalization projects activations onto a fixed-radius hypersphere, so token vectors evolve along the sphere's surface; attention and MLP updates correspond to "rotations" and translations on this manifold (Molina, 2023).
Effective Metric and Curvature: Queries and keys induce a model-specific metric tensor, and attention acts as a discrete connection implementing parallel transport. Trajectories thus reflect nontrivial curvature in the induced geometric structure (Sipio et al., 4 Nov 2025).
Low-dimensional Predictive Geometry: In tasks such as grid-world prediction, transformer activations occupy manifolds aligned with task-specific "sufficient statistics," often in low-dimensional subspaces (Brenner et al., 17 Mar 2026).

2. Metrics for Trajectory Geometry

Trajectory geometry is quantitatively characterized by several metrics, computed on the paths traced by representations:

Metric	Definition/Computation (per (Pandey et al., 8 Jun 2026))	Interpretation
Trajectory Length	$L(\tau) = \sum_{l=0}^{L-1} \\|h^{(l+1)} - h^{(l)}\\|_2$	Total path traversed in representation space
Trajectory Curvature	For $v^{(l)} = h^{(l)} - h^{(l-1)}$ , $\kappa^{(l)} = \arccos[(v^{(l)} \cdot v^{(l+1)}) / (\\|v^{(l)}\\|_2 \\|v^{(l+1)}\\|_2)]$	Local or mean turning angle; captures path bends
Length-to-Chord Ratio	$d$ 0	Deviation from geodesic; indicator of global curvature (Sipio et al., 4 Nov 2025)
Semantic Convergence Index (CI)	$d$ 1	How category clusters contract/expand with depth
Layerwise Cosine Similarity	$d$ 2	Monitors abrupt directional changes
Representational Stability	$d$ 3	Sensitivity to surface-form/lexical variation

These metrics, systematically analyzed for all layers, enable probe-free mechanistic insight into the model's internal sequence of computations (Pandey et al., 8 Jun 2026, Sipio et al., 4 Nov 2025). For token-level analysis, trajectory geometry is measured per input word; for population-level analysis, mean-pooling or other aggregation is employed.

3. Empirical Findings Across Domains and Models

Trajectory geometry exhibits robust, domain-general signatures across architectures and tasks:

Three-phase computation: In causal transformers, trajectory geometry exposes a three-phase processing structure: early encoding (low curvature, low category convergence), mid-layer elaboration (peak curvature, semantic convergence, and bifurcation for ambiguity resolution), and late-stage output preparation (stabilization/preparation for readout) (Pandey et al., 8 Jun 2026, Valeriani et al., 2023).
Curvature encodes task complexity: Reasoning and analogy prompts traverse trajectories of higher mean curvature ( $d$ 4– $d$ 5 radians) than simple lexical variations ( $d$ 6– $d$ 7 radians). This effect is task- and architecture-invariant, reflecting greater inferential or integration demands (Pandey et al., 8 Jun 2026).
Semantic "attractors" and coherence: Semantically related prompts exhibit sharp convergence (peak CI $d$ 8– $d$ 9) in mid-to-late layers, forming attractor-like geometries. This contraction is absent under random labels or ablated controls, confirming semantic consistency as a learned attribute (Pandey et al., 8 Jun 2026).
Ambiguity bifurcation: Ambiguous words in context (e.g., “bank” in “river bank” vs. “financial bank”) initially pursue similar trajectories, then diverge rapidly and achieve up to $H^{(l)} \in \mathbb{R}^{n \times d}$ 0 representational separation by the final layer (Pandey et al., 8 Jun 2026, Sipio et al., 4 Nov 2025).
Human-perceptual domain emergence: In LLMs, geometry transiently aligns with human perceptual spaces (color, pitch, emotion) at intermediate layers; this alignment rises, peaks (GPA $H^{(l)} \in \mathbb{R}^{n \times d}$ 1– $H^{(l)} \in \mathbb{R}^{n \times d}$ 2), then decays or plateaus, particularly late for abstract domains like emotion (Singh et al., 27 May 2026).

4. Theoretical Interpretations and Mechanistic Insights

Current research interprets trajectory geometry as exposing both the substrate and dynamics of transformer computation:

Curved-path computation: Representational trajectories do not follow straight lines; instead, they exhibit systematic bends orchestrated by attention’s induced metric and connection. Proxies such as turning angles and length-to-chord ratios confirm the presence of nontrivial curvature (Sipio et al., 4 Nov 2025, Molina, 2023).
Attention as parallel transport: The process of multi-head attention is mathematically analogous to parallel transport under a learned connection, with semantic features moved along discrete paths in representation space (Sipio et al., 4 Nov 2025).
Sphere-constrained dynamics: Layer normalization confines activations to a hypersphere, and each residual-update (from attention/MLP) implements a small “move” on this manifold; thus layerwise update sequences are geodesic walks shaped by Q/K directions and V/O values (Molina, 2023).
Transient emergence and dimensionality bottlenecks: Semantic and perceptual geometries emerge most robustly at layers corresponding to a minimum in intrinsic dimensionality (ID); here, representations are maximally compressed and semantically organized, before abstraction and contraction in later layers (Valeriani et al., 2023, Singh et al., 27 May 2026).
World-model alignment: In structured prediction tasks, transformer trajectories mirror analytic sufficient-statistic trajectories, reflecting faithful internalization of process geometry (e.g., in constrained random walks, activations track the minimal predictive vector) (Brenner et al., 17 Mar 2026).

5. Methodological Advances and Control Analyses

Modern studies employ both unsupervised and probe-free analyses:

Mean-pooling and local token-tracking: Both global and token-specific trajectories can be analyzed by pooling or individual extraction.
Multi-domain and multi-architecture protocols: Empirical studies span GPT-2, LLaMA, Gemma, Qwen, BERT, DistilBERT, RoBERTa, and vision transformers, using standardized metrics and prompt sets (Pandey et al., 8 Jun 2026, Singh et al., 27 May 2026, Valeriani et al., 2023).
Statistical and null-model controls: To rule out spurious curvature (from high-d randomness), null ensembles preserve step lengths but randomize directions; semantic clustering is tested against shuffled-category and shuffled-layer baselines. All observed geometric effects significantly (p < 0.001) exceed null expectations (Sipio et al., 4 Nov 2025, Pandey et al., 8 Jun 2026).
Interventional experiments: “Gravitational lensing” analogues show that context words can bend token trajectories, confirming causal contextualization via geometry (Sipio et al., 4 Nov 2025).

6. Implications, Applications, and Open Questions

The trajectory geometry perspective yields several implications:

Probe-free interpretability: Trajectory-based metrics enable unsupervised, architecture-agnostic mapping of computational phases, semantic integration zones, and disambiguation loci (Pandey et al., 8 Jun 2026).
Layer selection for semantic readout: Layers at or just past the minimum of the intrinsic-dimension profile maximize semantic structure; these are recommended for downstream tasks (Valeriani et al., 2023).
Mechanisms of abstraction: Intermediate trajectory geometry captures the transient construction of semantic, perceptual, or world-model manifolds that are then abstracted away in late layers (Singh et al., 27 May 2026).
Dynamical-systems and topology: Future directions include leveraging coordinate-free tools (persistent homology, causal intervention) and extending analyses to encoder–decoder, multilingual, and long-context models (Pandey et al., 8 Jun 2026).

Understanding transformer behavior from the trajectory geometry perspective reveals that the “path” through representational space—its shape, curvature, and context-dependence—encodes not only what the model knows, but precisely how it arrives at each computational transformation. This approach links geometric, mechanistic, and functional analyses across the rapidly evolving landscape of transformer interpretability.