Why deep sequence models memorize geometrically

Determine the mechanisms by which Transformer and Mamba sequence models, trained with next-token prediction and local edge supervision, develop global geometric parametric memory that encodes multi-hop relationships among entities, despite the absence of explicit architectural bottlenecks, optimization regularization, or multi-hop supervision.

Background

The paper demonstrates that, when trained to memorize graph edges in-weights, both Transformers and Mamba models exhibit a global geometric organization of token embeddings that supports implicit multi-hop reasoning. This behavior contrasts with the standard associative-memory view (local co-occurrence lookup) and is not straightforwardly explained by typical capacity, optimization, or supervisory pressures.

The authors partially explain analogous phenomena in simpler Node2Vec settings via a spectral bias, but emphasize that for deep sequence models the foundational reason such geometry arises remains unresolved.

References

Deep sequence models tend to memorize geometrically; it is unclear why.

— Deep sequence models tend to memorize geometrically; it is unclear why (2510.26745 - Noroozizadeh et al., 30 Oct 2025) in Title

Why deep sequence models memorize geometrically

Sponsor

Background

References

Related Problems