Titans and TTT Models: Scalable Memory & Adaptation

Updated 12 November 2025

Titans and TTT models are advanced neural architectures that combine Transformer-style attention, adaptive test-time training, and hierarchical memory systems.
They integrate short-term attention, long-term adaptive memory modules, and fixed persistent tokens to efficiently process very long sequences.
These systems achieve linear or near-linear scaling and state-of-the-art performance on diverse tasks by enabling on-the-fly specialization and online parameter adaptation.

Titans and TTT models denote a family of neural architectures and test-time training paradigms that jointly pursue scalable sequence modeling, deep online memorization, and adaptive specialization. Titans integrate a Transformer-style attention core (short-term memory), a neural long-term memory (a recurrent module with online parameter adaptation even at inference), and persistent memory tokens; TTT models capture the broader methodology of updating parts of a model at test time for improved specialization beyond the frozen, globally-trained weights. This unified family achieves linear or near-linear scaling, state-of-the-art performance on both standard and extreme long-context tasks, and provable improvements on underparameterized foundation models by leveraging both architectural and optimization-level mechanisms.

1. Architectural Foundations: Short-Term, Long-Term, and Persistent Memory

Titan architectures (Behrouz et al., 31 Dec 2024) are designed as multi-branch models comprising three memory pathways:

Short-Term Memory: Realized via Transformer-style causal or sliding-window attention over the most recent $L$ tokens, providing accurate local dependency modeling with $O(n L d)$ cost (for sequence length $n$ , window size $L$ , hidden size $d$ ).
Long-Term Memory (LMM): Implemented as a deep neural network $M_t(\cdot)$ with online-updated weights, storing compressive, associative summaries of arbitrarily long context through a gradient-based meta-learning rule. At each token $x_t$ , key and value projections $k_t = x_t W_K$ , $v_t = x_t W_V$ shape memory updates by gradient descent on the "surprise" loss $\ell(M_{t-1}; x_t) = \| M_{t-1}(k_t) - v_t \|_2^2$ , with momentum and data-dependent gating for selective writing and forgetting:

$M_t = (1-\alpha_t) M_{t-1} + [\eta_t S_{t-1} - \theta_t \nabla \ell(M_{t-1}; x_t)].$

Persistent Memory: The first $N_p$ sequence tokens per segment, with data-independent parameters trained offline and held fixed at test time, encoding priors.

Three architectural variants—Memory as Context (MAC), Memory as Gate (MAG), and Memory as Layer (MAL)—differ in their integration of LMM and attention:

Variant	LMM Placement	Attention Integration
MAC	Fused as additional context	LMM output concatenated with persistent + input tokens, full attention over chunk
MAG	Parallel	LMM and attention as separate branches, merged by a trainable gate
MAL	Sequential	Input passes through LMM, then sliding-window attention

A further "pure LMM" configuration removes attention entirely (Behrouz et al., 31 Dec 2024).

2. Mathematical Principles of Test-Time Training and Specialization

Test-time training (TTT) leverages the insight that globally-trained models, especially large foundation models, remain underparameterized relative to the size of the conceptual space they must model (Hübotter et al., 29 Sep 2025). The Linear Representation Hypothesis (LRH) posits:

There exists a high-dimensional concept map $\Phi:\mathcal{X}\to\mathbb{R}^{d_1}$ with input sparsity $s\ll d_1$ .
Learned features $\Psi:\mathcal{X}\to\mathbb{R}^{d_2}$ ( $d_2\ll d_1$ ) approximate $\Phi$ locally linearly: $\Psi(x)\approx P_x\Phi(x)$ .
The task predictor is a linear map in concept space: $f(x) = \langle \Phi(x), w^*\rangle$ .

At inference, TTT adapts a local head $v_{x^*}$ for a test input $x^*$ by re-fitting on its $k$ -nearest neighbors:

$v_{x^*} = \arg\min_v \frac{1}{k} \sum_{(x,y)\in\mathcal{B}^{\text{psi}}_{x^*}} \ell(\langle \Psi(x), v\rangle, y)$

This yields a local, sparse specialization: instead of the global predictor's test error $E[(f(x)-\langle \Psi(x), v_\text{global}\rangle)^2] \approx 1-d_2/d_1 \to 1$ for $d_2\ll d_1$ , TTT's error is $O(\sigma^2 s \log(d_1/s)/k)$ , focusing on the active concepts in the local neighborhood. Thus, TTT allows per-test-sample adaptation, reallocating capacity to the most relevant conceptual subspace and potentially outperforming frozen global heads, particularly in underparameterized regimes.

3. Practical Algorithms and Implementation Workflows

Titans and TTT architectures can be directly instantiated as RNNs with deep test-time memorization modules. Several core algorithms and workflow elements are shared or have been systematically improved (Behrouz et al., 31 Dec 2024, Li et al., 10 Nov 2025):

Memory Updates: For every token (or chunk), gradient-based memory compression alternates with fast retrieval. Memory compression follows:

$W_t = W_{t-1} - \eta_t \nabla_W \mathcal{L}(f(W_{t-1}, k_t), v_t),$

while retrieval is a forward pass: $o_t = f(W_t, q_t)$ .

Chunkwise Training and the Throughput-Performance Tradeoff: Efficient training splits the sequence into chunks of size $C$ to permit intra-chunk parallelization, but limits memory updates within each chunk. Larger $C$ accelerates training but impairs test-time granularity, making the choice of $C$ a central hyperparameter for chunked models (Li et al., 10 Nov 2025).
TNT Paradigm: To decouple efficiency and performance, TNT uses a two-stage approach:
1. Stage 1: Pre-train with hierarchical memories (global: large $C_G$ ; local: small $C_L$ shards run in parallel) and periodic memory resets for local modules.
2. Stage 2: Fine-tune only local memory modules at smaller $C_L'$ , while freezing global memory, recovering or surpassing small-chunk model accuracy with minimal compute.
Titanization of Transformers (TPTT): TPTT (Furfaro, 21 Jun 2025) converts any pretrained causal-decoder Transformer into a Titan model by:
- Wrapping self-attention with a LiZAttention module, which computes both the original softmax attention ( $O_\text{base}$ ) and mixed linearized attention ( $O_\text{lin}$ ) via a kernel-feature approximation.
- Fusing outputs with a Memory as Gate (MaG) scalar gate $\alpha$ : $O = \alpha\,O_\text{lin} + (1-\alpha)\,O_\text{base}$ .
- Parameter-efficient fine-tuning (LoRA) on projection layers and $\alpha$ , keeping the vast majority of weights frozen.

4. Computational Complexity, Scaling, and Hardware Considerations

Titans and TTT models are explicitly designed for linear or near-linear scaling with sequence length (Behrouz et al., 31 Dec 2024, Li et al., 10 Nov 2025, Furfaro, 21 Jun 2025):

Model	Time per Step	Memory Usage	Notes
Transformer (softmax)	$O(n^2 d)$	$O(n^2)$	Quadratic in sequence
Linear RNN (e.g. Mamba)	$O(n d^2)$	$O(d^2)$	Linear in $n$ , large $d$
Titan (attn + LMM)	$O(nL d) + O(n p)$	$O(L n) + O(p)$	$L\ll n$ , $p$ is memory net
TPTT/LiZA	$O(n d^2)$	$O(n d)$	Linear/tunable via LiZA
TNT (chunked)	$O(n/C)$ parallel	$O(p C)$ per chunk	Hardware saturation

TNT achieves up to $17\times$ throughput gains compared to unchunked Titans, matching or exceeding the efficiency of FlashAttention at long sequence lengths. The two-stage Stage 1/Stage 2 approach decouples chunk size for training and inference, optimizing for both speed and final accuracy (Li et al., 10 Nov 2025).

5. Empirical Performance and Specialization Regimes

Experimental results verify the effectiveness and efficiency of these approaches across diverse tasks and scaling regimes (Behrouz et al., 31 Dec 2024, Hübotter et al., 29 Sep 2025, Furfaro, 21 Jun 2025, Li et al., 10 Nov 2025):

Language Modeling (Wikitext, 340M parameters, 4K context): Perplexity falls from Transformer++ 31.52 to MAG 25.07; average reasoning accuracy improves from 42.9% (Transformer++) to 47.5% (MAG).
Needle-in-a-Haystack (S-NIAH): MAG achieves 97.4% on S-NIAH-W 16K, surpassing TTT and DeltaNet baselines.
Long-Context Reasoning (BABILong): MAC variant outperforms GPT-4 and Mamba2.8B at $10\,$ K– $100\,$ K tokens.
Time Series and Genomics: LMM module attains best published MSE/MAE on ETT, ECL, Traffic, and Weather datasets, and superior accuracy on genomics benchmarks.
Scaling to Very Long Contexts: Titans process $>\!2$ M tokens in linear time, beyond Transformer hardware constraints ( $\sim$ 65K).
TTT Scaling Analysis (Hübotter et al., 29 Sep 2025):
- TTT yields largest gains where model head capacity is limited ( $d_2\ll d_1$ ), especially in mid-sized models (0.5B–10B).
- Accuracy improvements of 1–3% in vision, $0.05–0.15$ bpb in autoregressive language modeling.
- TTT benefits decrease above $\sim$ 30–100B parameters, where global solutions can disentangle more concepts.
TPTT (Transforming Pretrained Transformer into Titans) (Furfaro, 21 Jun 2025):
- Titans-Llama-3.2-1B achieves Exact Match (EM) of 0.2456 vs. base Llama-3.2-1B at 0.0070 on MMLU ( $\sim$ 20x gain).
- Linearized attention path enables $O(n d^2)$ inference, with LoRA-based finetuning preserving most pretrained weights.
- All modifications are compatible with Hugging Face APIs and require no retraining from scratch.

6. Advantages, Limitations, and Practical Recommendations

Titans and TTT models present a modular, extensible alternative to pure-attention Transformers and canonical RNNs:

Advantages:

Online Adaptation: LMM branch allows genuine test-time learning; models can adapt to OOD data or distributional shifts sequentially.
Memory Hierarchy: Persistent–contextual memory separation supports both fixed priors and rapidly-adaptive context.
Linear Scaling: Achievable via chunkwise parallelism and kernelized attention; practical for extremely long sequences.
Compatibility: Frameworks like TPTT enable rapid "Titanization" (Editor's term) of pretrained Transformers, with PEFT methods (e.g., LoRA) minimizing tuning overhead.

Limitations:

TTT Overhead: Test-time (and inference) requires on-the-fly gradient computation, increasing latency unless specialized hardware is used or fast local heads (LoRA, adapters) are employed.
Hyperparameter Sensitivity: Memory-network size $p$ , gating coefficients $(\alpha_t, \eta_t, \theta_t)$ , and chunk size $C$ require careful tuning to avoid catastrophic forgetting or memorization collapse.
Use Case Specialization: MAC versus MAG versus MAL have distinct tradeoffs for long-range capture versus throughput.

Best Practices:

Employ TNT-style staged training for efficiency with Titans or deep-memory RNNs, separating chunk size for speed and final accuracy.
Use adaptive gating and momentum in LMM to balance stability and fast adaptation.
For Titanizing existing models, install TPTT, apply LiZA/MaG modules, fine-tune LoRA adapters on target data, and deploy with standard APIs (Furfaro, 21 Jun 2025).

When to use TTT: TTT should be considered when model head performance is not saturated and local fine-tuning still yields gains on seen (or similar) data—a regime most common in mid-sized models with non-trivial underparameterization (Hübotter et al., 29 Sep 2025).

7. Extensions, Controversies, and Future Directions

Potential future developments and open issues include:

Specialized Memory Modules: Replacing MLP-based memory with graph, Transformer, or SSM-based modules may further enhance compressive capacity.
Conditional/learned gating: Learning gating mechanisms entirely from data (rather than hand-crafting $\alpha_t, \eta_t, \theta_t$ ) to improve robustness and flexibility.
Hierarchical and multi-resolution chunking: Extending the TNT hierarchy to $N>1$ local modules, exploring variable chunking strategies for multi-scale adaptation.
Test-Time Specialization vs. Mixture-of-Experts: MoE architectures may asymptotically match TTT improvements by encoding multiple local heads ahead of time, obviating online optimization cost if expert routing is accurate.
Kernel-level optimization: Closing the gap with FlashAttention and BERT-style optimized kernels for recurrent/parallel scan and memory operations.
Sparsity and Concept Disentanglement Analysis: Elucidating the connection between explicit and implicit sparsity in concept space and TTT’s success in local adaptation, particularly in the context of the LRH (Hübotter et al., 29 Sep 2025).

Controversies include the computational cost of TTT at scale, the diminishing returns above certain parameter thresholds, and the optimality of memory-net architectures for different modalities and domains.

Titans and TTT models collectively offer a principled, empirically validated approach to test-time specialization, online memory-augmented inference, and scalable long-context modeling, bridging the gap between theoretical motivation and practical applicability across domains.