The Geometric Alignment Tax: Tokenization vs. Continuous Geometry in Scientific Foundation Models

Published 5 Apr 2026 in cs.LG, cs.IT, q-bio.QM, and stat.ML | (2604.04155v1)

Abstract: Foundation models for biology and physics optimize predictive accuracy, but their internal representations systematically fail to preserve the continuous geometry of the systems they model. We identify the root cause: the Geometric Alignment Tax, an intrinsic cost of forcing continuous manifolds through discrete categorical bottlenecks. Controlled ablations on synthetic dynamical systems demonstrate that replacing cross-entropy with a continuous head on an identical encoder reduces geometric distortion by up to 8.5x, while learned codebooks exhibit a non-monotonic double bind where finer quantization worsens geometry despite improving reconstruction. Under continuous objectives, three architectures differ by 1.3x; under discrete tokenization, they diverge by 3,000x. Evaluating 14 biological foundation models with rate-distortion theory and MINE, we identify three failure regimes: Local-Global Decoupling, Representational Compression, and Geometric Vacuity. A controlled experiment confirms that Evo 2's reverse-complement robustness on real DNA reflects conserved sequence composition, not learned symmetry. No model achieves simultaneously low distortion, high mutual information, and global coherence.

Abstract PDF Upgrade to Chat

Authors (1)

Prashant C. Raju

Summary

The paper demonstrates that discrete tokenization introduces a significant geometric alignment tax, with up to an 8.5× distortion gap in Procrustes distance.
The paper employs controlled synthetic experiments and model evaluations across continuous physics and discrete biology tracks to isolate tokenization as the root cause of manifold fracture.
The paper shows that scaling models or increasing codebook granularity fails to restore global geometric fidelity, highlighting the need for continuous objectives in foundation models.

The Geometric Alignment Tax in Scientific Foundation Models: Mechanistic and Empirical Analysis

Introduction

This paper interrogates the representational geometry of current foundation models in biology and physics, focusing on the intrinsic geometric distortion incurred by discrete tokenization of fundamentally continuous physical manifolds—a phenomenon termed the Geometric Alignment Tax. Through a rigorous program of controlled synthetic experiments, information-theoretic formalization, and model zoo-scale empirical evaluation, the study establishes that discretization at the token level (e.g., via cross-entropy objectives over categorical vocabularies) is the core bottleneck for faithfully aligning learned representations with their continuous scientific ground truths. The analysis exposes the limitations of architectural modifications (Transformer, SSM, hybrids), regularization schemes, and tokenization granularity, and illustrates how existing evaluation metrics fail to surface these critical geometric deficiencies.

The Causal Structure of the Geometric Tax

The study first isolates tokenization as the sole causal mechanism underlying manifold fracture. In controlled synthetic dynamical systems (sine waves, damped oscillators, Lorenz attractors), matched-parameter architectures are ablated by replacing categorical cross-entropy (CE) heads with continuous regression (MSE) heads. The result is a discrete-to-continuous distortion gap of up to $8.5\times$ in Procrustes distance, with all continuous variants converging to low distortion regardless of backbone (Transformer, SSM, hybrid). The variation attributable to the encoder backbone is minor ( $1.3\times$ ) compared to the $3,000\times$ distortion gap observed under discrete tokenization in biological track tasks. Vector quantization (VQ) ablations further demonstrate a double bind: finer codebooks improve reconstruction error but worsen geometric continuity (as measured by boundary-crossing probability and Procrustes D), with empirical distortion following an intractably slow $1/\!\log(K)$ law.

Figure 1: (A) Smooth geometric profiles for continuous-track models contrast with fractured, highly variable profiles for discrete biology; (B) Continuous objectives cluster near zero Procrustes error, while discrete-tokenized models span orders of magnitude; (C) Increasing codebook size improves reconstruction but degrades global geometry.

Empirical Anatomy of Manifold Fracture

Extending beyond synthetic domains, an incisive evaluation protocol measures geometric stability via Representational Dissimilarity Matrix (RDM) similarity, perturbation mapping, and Procrustes alignment across perturbations. In a dual-track protocol:

Track A (Continuous Physics): Smooth transitions and low Lipschitz constants characterize all models; differences across architectures are minimal.
Track B (Discrete Biology): Foundation models operating on tokenized biological sequences (DNA, proteins) exhibit divergent, often catastrophic fracture—Transformer-based models show orders-of-magnitude higher Lipschitz variation per single mutation compared to SSM-based architectures.

Crucially, the sole variable distinguishing these tracks is not routing or contextual attention but the introduction of a discrete, fixed-vocabulary bottleneck.

Tax Scaling Laws and Model Families

Large-scale evaluations of model suites (ESM-2, Nucleotide Transformer, SaProt, Caduceus, Evo 2) establish several robust scaling laws for the Geometric Alignment Tax:

Monotonic Degradation with Scale: For all Transformer/k-mer architecture families, composite geometric stability declines progressively from 8M to multi-billion parameter scales.
Phantom Recovery: Apparent geometric “recovery” in ultra-large models (ESM-2-15B, NT-2.5B) is unmasked as global manifold drift (Untethered Gel) via Procrustes reduction analysis rather than genuine low-distortion structure.
Figure 2: (A) Stability declines monotonically on ESM-2 with scale, with a misleading recovery at 15B that is accompanied by a spike in Procrustes reduction, indicating global drift; (B) Conceptual illustration of 'Brittle Glass' (internal fracture) and 'Untethered Gel' (coherent manifold drift) regimes.

Ablations, including Jacobian norm penalties and attention ratio sweeps, confirm that neither explicit smoothness penalties nor reduction in attention layers sufficiently rescues the tax without paying a cost in predictive accuracy.

Information-Theoretic Characterization and Pathological Regimes

The alignment tax is formalized via rate–distortion theory: the discrete representation channel’s capacity limit dictates the achievable lower bound on geometric distortion, and empirical scaling with codebook size validates a logarithmic (not $1/K$) decay of geometric error with increased vocabulary. Mutual information estimation (via MINE) locates foundation models in one of three pathological regimes:

Local–Global Decoupling: High local biological signal (e.g., $k$ -mer histograms in Evo 2), but no integration of global structure, corresponding to Untethered Gel.
Representational Compression: High mutual information (e.g., OpenFold’s Evoformer), but severe warping/fracture of the geometry, matching Brittle Glass.
Geometric Vacuity: Apparently smooth, stable representations that are information-empty, carrying less ground truth association than random baselines (e.g., ProtMamba).
Figure 3: (A) Excess MI across pathologies; ProtMamba shows vacuity, Evo 2 is shallow/decoupled, OpenFold/ESM-1b compress but warp geometry; (B) Evo 2: MI is local, unresponsive to context expansion; (C) OpenFold: compression increases MI but also Procrustes disparity.

Notably, no evaluated model achieves low distortion, high mutual information, and global coherence simultaneously at any parameter or context expansion setting.

Symmetry Dissociation and Regularization Inefficacy

DNA modeling exposes further tax implications: no state-of-the-art model achieves robust reverse-complement (RC) equivariance unless it is architecturally hardwired (e.g., Caduceus). Experimental RC regularization (RCCR) on DNABERT-2 achieves perfect per-sequence RC consistency but worsens population-level Procrustes alignment—capacity poured into symmetry enforcement is simply stolen from manifold preservation, not additive.

Figure 4: (A) RCCR loss collapses, achieving perfect pointwise RC alignment; (B) RC cosine gap drops to zero; (C) However, global geometry as measured by Procrustes disparity degrades by 91%; (D) Overall composite stability appears higher, but the improvement is illusory—structure is lost.

Targeted experiments establish that apparent RC “robustness” in models like Evo 2 is not functional equivariance but a statistical artifact, explainable in full by per-sequence $k$ -mer histograms (the Texture Hypothesis test), not structural understanding.

Figure 5: (A) Dinuc-shuffled real DNA perfectly recapitulates RC RDM similarity, supporting the histogram explanation; Markov-matched textures do not recover the effect; (B) Schematic: on synthetic DNA, RC mapping is destroyed; on real DNA, texture masks the failure.

Implications for Foundation Modeling

The central assertion of this work is that discrete tokenization with cross-entropy is a sufficient condition for the emergence of severe geometric distortion in embedding manifolds of scientific models, irrespective of architecture or scale. This tax is:

Progressive with Model Size: Scaling up capacity worsens the tax, introducing sharper, denser regions of decision boundaries and, ultimately, manifold drift.
Invariant to Context Length: Increasing context does not recover information or geometry when the tax is already imposed; untethered or vacuous behaviors persist.
Unmitigated by Common Regularization: Explicit smoothness or symmetry regularization produces a cost elsewhere—any attempt to enforce one constraint comes at the expense of distortion elsewhere, as predicted by rate–distortion bounds.

Practically, the findings demonstrate that reliance on predictive metrics (perplexity, AUC, benchmarks) is insufficient: models may deliver high accuracy while their representations are fractured or, even in the presence of smooth structure, biologically uninformative. For real scientific applications—therapeutic design, materials discovery, biological sequence search—physical alignment becomes a necessary evaluation criterion.

Future Directions

The results chart a negative result for current architectural dogma. To reach physical or scientific AGI, hybrid or fully continuous objectives, architectural equivariance, or methods optimizing both predictive accuracy and manifold stability are required. Promising lines include:

Continuous-valued embedding heads beyond MSE
SSM-based architectures (with architectural equivariance e.g. RC handling)
Geometric stability auditing as a first-class evaluation during training
Differentiable rate–distortion regularization for scientific manifold preservation
Mechanisms enabling discrete models to “borrow” geometric priors from continuous ODEs without information-vacuity collapse

Conclusion

The Geometric Alignment Tax constitutes a fundamental, regime-crossing limitation for discrete-token scientific foundation models. Empirical, ablation, and information-theoretic evidence establish that scaling, more granular tokenization, or post-hoc patching cannot rescue geometric faithfulness to continuous physical symmetries. This tax creates model pathologies—fracture, decoupled drift, vacuity—that remain invisible to standard benchmarks, forming a concrete barrier to the scientific applicability and reliability of Transformer-based or tokenized deep learning models in the natural sciences. Progress will require architectural principles native to the continuous, invariant structure of biology and physics, rather than further scaling of fundamentally discrete frameworks.

Markdown Report Issue