Papers
Topics
Authors
Recent
2000 character limit reached

Titans and TTT Models: Scalable Memory & Adaptation

Updated 12 November 2025
  • Titans and TTT models are advanced neural architectures that combine Transformer-style attention, adaptive test-time training, and hierarchical memory systems.
  • They integrate short-term attention, long-term adaptive memory modules, and fixed persistent tokens to efficiently process very long sequences.
  • These systems achieve linear or near-linear scaling and state-of-the-art performance on diverse tasks by enabling on-the-fly specialization and online parameter adaptation.

Titans and TTT models denote a family of neural architectures and test-time training paradigms that jointly pursue scalable sequence modeling, deep online memorization, and adaptive specialization. Titans integrate a Transformer-style attention core (short-term memory), a neural long-term memory (a recurrent module with online parameter adaptation even at inference), and persistent memory tokens; TTT models capture the broader methodology of updating parts of a model at test time for improved specialization beyond the frozen, globally-trained weights. This unified family achieves linear or near-linear scaling, state-of-the-art performance on both standard and extreme long-context tasks, and provable improvements on underparameterized foundation models by leveraging both architectural and optimization-level mechanisms.

1. Architectural Foundations: Short-Term, Long-Term, and Persistent Memory

Titan architectures (Behrouz et al., 31 Dec 2024) are designed as multi-branch models comprising three memory pathways:

  • Short-Term Memory: Realized via Transformer-style causal or sliding-window attention over the most recent LL tokens, providing accurate local dependency modeling with O(nLd)O(n L d) cost (for sequence length nn, window size LL, hidden size dd).
  • Long-Term Memory (LMM): Implemented as a deep neural network Mt()M_t(\cdot) with online-updated weights, storing compressive, associative summaries of arbitrarily long context through a gradient-based meta-learning rule. At each token xtx_t, key and value projections kt=xtWKk_t = x_t W_K, vt=xtWVv_t = x_t W_V shape memory updates by gradient descent on the "surprise" loss (Mt1;xt)=Mt1(kt)vt22\ell(M_{t-1}; x_t) = \| M_{t-1}(k_t) - v_t \|_2^2, with momentum and data-dependent gating for selective writing and forgetting:

Mt=(1αt)Mt1+[ηtSt1θt(Mt1;xt)].M_t = (1-\alpha_t) M_{t-1} + [\eta_t S_{t-1} - \theta_t \nabla \ell(M_{t-1}; x_t)].

  • Persistent Memory: The first NpN_p sequence tokens per segment, with data-independent parameters trained offline and held fixed at test time, encoding priors.

Three architectural variants—Memory as Context (MAC), Memory as Gate (MAG), and Memory as Layer (MAL)—differ in their integration of LMM and attention:

Variant LMM Placement Attention Integration
MAC Fused as additional context LMM output concatenated with persistent + input tokens, full attention over chunk
MAG Parallel LMM and attention as separate branches, merged by a trainable gate
MAL Sequential Input passes through LMM, then sliding-window attention

A further "pure LMM" configuration removes attention entirely (Behrouz et al., 31 Dec 2024).

2. Mathematical Principles of Test-Time Training and Specialization

Test-time training (TTT) leverages the insight that globally-trained models, especially large foundation models, remain underparameterized relative to the size of the conceptual space they must model (Hübotter et al., 29 Sep 2025). The Linear Representation Hypothesis (LRH) posits:

  • There exists a high-dimensional concept map Φ:XRd1\Phi:\mathcal{X}\to\mathbb{R}^{d_1} with input sparsity sd1s\ll d_1.
  • Learned features Ψ:XRd2\Psi:\mathcal{X}\to\mathbb{R}^{d_2} (d2d1d_2\ll d_1) approximate Φ\Phi locally linearly: Ψ(x)PxΦ(x)\Psi(x)\approx P_x\Phi(x).
  • The task predictor is a linear map in concept space: f(x)=Φ(x),wf(x) = \langle \Phi(x), w^*\rangle.

At inference, TTT adapts a local head vxv_{x^*} for a test input xx^* by re-fitting on its kk-nearest neighbors:

vx=argminv1k(x,y)Bxpsi(Ψ(x),v,y)v_{x^*} = \arg\min_v \frac{1}{k} \sum_{(x,y)\in\mathcal{B}^{\text{psi}}_{x^*}} \ell(\langle \Psi(x), v\rangle, y)

This yields a local, sparse specialization: instead of the global predictor's test error E[(f(x)Ψ(x),vglobal)2]1d2/d11E[(f(x)-\langle \Psi(x), v_\text{global}\rangle)^2] \approx 1-d_2/d_1 \to 1 for d2d1d_2\ll d_1, TTT's error is O(σ2slog(d1/s)/k)O(\sigma^2 s \log(d_1/s)/k), focusing on the active concepts in the local neighborhood. Thus, TTT allows per-test-sample adaptation, reallocating capacity to the most relevant conceptual subspace and potentially outperforming frozen global heads, particularly in underparameterized regimes.

3. Practical Algorithms and Implementation Workflows

Titans and TTT architectures can be directly instantiated as RNNs with deep test-time memorization modules. Several core algorithms and workflow elements are shared or have been systematically improved (Behrouz et al., 31 Dec 2024, Li et al., 10 Nov 2025):

  • Memory Updates: For every token (or chunk), gradient-based memory compression alternates with fast retrieval. Memory compression follows:

Wt=Wt1ηtWL(f(Wt1,kt),vt),W_t = W_{t-1} - \eta_t \nabla_W \mathcal{L}(f(W_{t-1}, k_t), v_t),

while retrieval is a forward pass: ot=f(Wt,qt)o_t = f(W_t, q_t).

  • Chunkwise Training and the Throughput-Performance Tradeoff: Efficient training splits the sequence into chunks of size CC to permit intra-chunk parallelization, but limits memory updates within each chunk. Larger CC accelerates training but impairs test-time granularity, making the choice of CC a central hyperparameter for chunked models (Li et al., 10 Nov 2025).
  • TNT Paradigm: To decouple efficiency and performance, TNT uses a two-stage approach:

    1. Stage 1: Pre-train with hierarchical memories (global: large CGC_G; local: small CLC_L shards run in parallel) and periodic memory resets for local modules.
    2. Stage 2: Fine-tune only local memory modules at smaller CLC_L', while freezing global memory, recovering or surpassing small-chunk model accuracy with minimal compute.
  • Titanization of Transformers (TPTT): TPTT (Furfaro, 21 Jun 2025) converts any pretrained causal-decoder Transformer into a Titan model by:

    • Wrapping self-attention with a LiZAttention module, which computes both the original softmax attention (ObaseO_\text{base}) and mixed linearized attention (OlinO_\text{lin}) via a kernel-feature approximation.
    • Fusing outputs with a Memory as Gate (MaG) scalar gate α\alpha: O=αOlin+(1α)ObaseO = \alpha\,O_\text{lin} + (1-\alpha)\,O_\text{base}.
    • Parameter-efficient fine-tuning (LoRA) on projection layers and α\alpha, keeping the vast majority of weights frozen.

4. Computational Complexity, Scaling, and Hardware Considerations

Titans and TTT models are explicitly designed for linear or near-linear scaling with sequence length (Behrouz et al., 31 Dec 2024, Li et al., 10 Nov 2025, Furfaro, 21 Jun 2025):

Model Time per Step Memory Usage Notes
Transformer (softmax) O(n2d)O(n^2 d) O(n2)O(n^2) Quadratic in sequence
Linear RNN (e.g. Mamba) O(nd2)O(n d^2) O(d2)O(d^2) Linear in nn, large dd
Titan (attn + LMM) O(nLd)+O(np)O(nL d) + O(n p) O(Ln)+O(p)O(L n) + O(p) LnL\ll n, pp is memory net
TPTT/LiZA O(nd2)O(n d^2) O(nd)O(n d) Linear/tunable via LiZA
TNT (chunked) O(n/C)O(n/C) parallel O(pC)O(p C) per chunk Hardware saturation

TNT achieves up to 17×17\times throughput gains compared to unchunked Titans, matching or exceeding the efficiency of FlashAttention at long sequence lengths. The two-stage Stage 1/Stage 2 approach decouples chunk size for training and inference, optimizing for both speed and final accuracy (Li et al., 10 Nov 2025).

5. Empirical Performance and Specialization Regimes

Experimental results verify the effectiveness and efficiency of these approaches across diverse tasks and scaling regimes (Behrouz et al., 31 Dec 2024, Hübotter et al., 29 Sep 2025, Furfaro, 21 Jun 2025, Li et al., 10 Nov 2025):

  • Language Modeling (Wikitext, 340M parameters, 4K context): Perplexity falls from Transformer++ 31.52 to MAG 25.07; average reasoning accuracy improves from 42.9% (Transformer++) to 47.5% (MAG).
  • Needle-in-a-Haystack (S-NIAH): MAG achieves 97.4% on S-NIAH-W 16K, surpassing TTT and DeltaNet baselines.
  • Long-Context Reasoning (BABILong): MAC variant outperforms GPT-4 and Mamba2.8B at 1010\,K–100100\,K tokens.
  • Time Series and Genomics: LMM module attains best published MSE/MAE on ETT, ECL, Traffic, and Weather datasets, and superior accuracy on genomics benchmarks.
  • Scaling to Very Long Contexts: Titans process > ⁣2>\!2M tokens in linear time, beyond Transformer hardware constraints (\sim65K).
  • TTT Scaling Analysis (Hübotter et al., 29 Sep 2025):
    • TTT yields largest gains where model head capacity is limited (d2d1d_2\ll d_1), especially in mid-sized models (0.5B–10B).
    • Accuracy improvements of 1–3% in vision, $0.05–0.15$ bpb in autoregressive language modeling.
    • TTT benefits decrease above \sim30–100B parameters, where global solutions can disentangle more concepts.
  • TPTT (Transforming Pretrained Transformer into Titans) (Furfaro, 21 Jun 2025):
    • Titans-Llama-3.2-1B achieves Exact Match (EM) of 0.2456 vs. base Llama-3.2-1B at 0.0070 on MMLU (\sim20x gain).
    • Linearized attention path enables O(nd2)O(n d^2) inference, with LoRA-based finetuning preserving most pretrained weights.
    • All modifications are compatible with Hugging Face APIs and require no retraining from scratch.

6. Advantages, Limitations, and Practical Recommendations

Titans and TTT models present a modular, extensible alternative to pure-attention Transformers and canonical RNNs:

Advantages:

  • Online Adaptation: LMM branch allows genuine test-time learning; models can adapt to OOD data or distributional shifts sequentially.
  • Memory Hierarchy: Persistent–contextual memory separation supports both fixed priors and rapidly-adaptive context.
  • Linear Scaling: Achievable via chunkwise parallelism and kernelized attention; practical for extremely long sequences.
  • Compatibility: Frameworks like TPTT enable rapid "Titanization" (Editor's term) of pretrained Transformers, with PEFT methods (e.g., LoRA) minimizing tuning overhead.

Limitations:

  • TTT Overhead: Test-time (and inference) requires on-the-fly gradient computation, increasing latency unless specialized hardware is used or fast local heads (LoRA, adapters) are employed.
  • Hyperparameter Sensitivity: Memory-network size pp, gating coefficients (αt,ηt,θt)(\alpha_t, \eta_t, \theta_t), and chunk size CC require careful tuning to avoid catastrophic forgetting or memorization collapse.
  • Use Case Specialization: MAC versus MAG versus MAL have distinct tradeoffs for long-range capture versus throughput.

Best Practices:

  • Employ TNT-style staged training for efficiency with Titans or deep-memory RNNs, separating chunk size for speed and final accuracy.
  • Use adaptive gating and momentum in LMM to balance stability and fast adaptation.
  • For Titanizing existing models, install TPTT, apply LiZA/MaG modules, fine-tune LoRA adapters on target data, and deploy with standard APIs (Furfaro, 21 Jun 2025).

When to use TTT: TTT should be considered when model head performance is not saturated and local fine-tuning still yields gains on seen (or similar) data—a regime most common in mid-sized models with non-trivial underparameterization (Hübotter et al., 29 Sep 2025).

7. Extensions, Controversies, and Future Directions

Potential future developments and open issues include:

  • Specialized Memory Modules: Replacing MLP-based memory with graph, Transformer, or SSM-based modules may further enhance compressive capacity.
  • Conditional/learned gating: Learning gating mechanisms entirely from data (rather than hand-crafting αt,ηt,θt\alpha_t, \eta_t, \theta_t) to improve robustness and flexibility.
  • Hierarchical and multi-resolution chunking: Extending the TNT hierarchy to N>1N>1 local modules, exploring variable chunking strategies for multi-scale adaptation.
  • Test-Time Specialization vs. Mixture-of-Experts: MoE architectures may asymptotically match TTT improvements by encoding multiple local heads ahead of time, obviating online optimization cost if expert routing is accurate.
  • Kernel-level optimization: Closing the gap with FlashAttention and BERT-style optimized kernels for recurrent/parallel scan and memory operations.
  • Sparsity and Concept Disentanglement Analysis: Elucidating the connection between explicit and implicit sparsity in concept space and TTT’s success in local adaptation, particularly in the context of the LRH (Hübotter et al., 29 Sep 2025).

Controversies include the computational cost of TTT at scale, the diminishing returns above certain parameter thresholds, and the optimality of memory-net architectures for different modalities and domains.


Titans and TTT models collectively offer a principled, empirically validated approach to test-time specialization, online memory-augmented inference, and scalable long-context modeling, bridging the gap between theoretical motivation and practical applicability across domains.

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Titans and TTT Models.