Why deep sequence models memorize geometrically
Determine the mechanisms by which Transformer and Mamba sequence models, trained with next-token prediction and local edge supervision, develop global geometric parametric memory that encodes multi-hop relationships among entities, despite the absence of explicit architectural bottlenecks, optimization regularization, or multi-hop supervision.
References
Deep sequence models tend to memorize geometrically; it is unclear why.
— Deep sequence models tend to memorize geometrically; it is unclear why
(2510.26745 - Noroozizadeh et al., 30 Oct 2025) in Title