Papers
Topics
Authors
Recent
Search
2000 character limit reached

TTT-MLP: Test-Time Training & Tensor-Train MLP

Updated 25 February 2026
  • TTT-MLP is a dual framework combining Tensor-Train MLP for 95% parameter compression with a test-time trainable MLP hidden state for dynamic sequence modeling.
  • The Tensor-Train variant employs Alternating Least Squares to update low-order tensor cores, achieving rapid convergence and robust performance.
  • The test-time training MLP adapts per token using self-supervised gradient updates, providing linear computational complexity ideal for long-context tasks.

TTT-MLP refers to two distinct but technically significant architectures that leverage multilayer perceptrons (MLPs) in unconventional ways: (1) as a highly parameter-efficient variant of classical MLPs via Tensor-Train decomposition (Costa et al., 2021), and (2) as a test-time trainable RNN layer whose hidden state itself consists of a two-layer MLP, optimized by self-supervised learning during inference (Sun et al., 2024). Both interpretations address key deficiencies in mainstream deep learning frameworks: the former targets storage and robustness, the latter, expressiveness and scalability for long-context sequence modeling.

1. Tensor-Train MLP: Foundations and Methodology

The Tensor-Train MLP (TT-MLP) is based on the Tensor-Train (TT) decomposition, which represents a high-dimensional tensor as a sequential product of low-order tensor "cores." For a ddth-order tensor W∈Rn1×n2×⋯×ndW\in\mathbb{R}^{n_1\times n_2\times\cdots\times n_d}, each entry is expressed as

W[i1,i2,…,id]=∑α0=1r0∑α1=1r1⋯∑αd=1rdG1[α0,i1,α1]  G2[α1,i2,α2]  ⋯  Gd[αd−1,id,αd]W[i_1,i_2,\dots,i_d] = \sum_{\alpha_0=1}^{r_0}\sum_{\alpha_1=1}^{r_1}\cdots\sum_{\alpha_d=1}^{r_d} G_1[\alpha_0,i_1,\alpha_1]\; G_2[\alpha_1,i_2,\alpha_2]\; \cdots\; G_d[\alpha_{d-1},i_d,\alpha_d]

where each core GkG_k is a 3D tensor of shape rk−1×nk×rkr_{k-1} \times n_k \times r_k with r0=rd=1r_0 = r_d = 1 and {rk}\{r_k\} the TT-ranks (Costa et al., 2021).

To compress an MLP's weights, the M×NM \times N fully connected layer is reshaped into a $2d$-order tensor and approximated via its TT form. The resulting parameter count is

∑k=1drk−1mknkrk=O(dnr2)\sum_{k=1}^d r_{k-1} m_k n_k r_k = O(dn r^2)

whereas direct parameterization is O(MN)O(MN). Compression of up to 95% is typical with negligible loss in accuracy.

2. Alternating Least Squares Optimization in TT-MLP

Training TT-MLP layers employs an Alternating Least Squares (ALS) scheme, updating each TT-core in turn while fixing others. For a batch of input features Φ(x)\Phi(x) and target yy, the model constructs an output by contracting cores and features, then optimizes each core via regularized least squares:

θ^k=(Pk⊤Pk+λMLk⊤Lk)−1Pk⊤y\hat{\theta}_k = (P_k^\top P_k + \lambda M L_k^\top L_k)^{-1} P_k^\top y

where PkP_k encodes the product of "environment" matrices and local features, and LkL_k enforces an â„“2\ell_2 (Tikhonov) penalty. Each update is efficient, typically requiring only a small number of sweeps for convergence, and notably robust to random initialization (Costa et al., 2021).

3. TTT-MLP: Test-Time Training with MLP Hidden State

In the "Test-Time Training" (TTT-MLP) variant, the hidden state at each sequence step is the set of weights of a two-layer MLP:

Wt={Wt1∈Rh×d, bt1∈Rh, Wt2∈Rd×h, bt2∈Rd}W_t = \{ W_t^1 \in \mathbb{R}^{h \times d},\ b_t^1 \in \mathbb{R}^h,\ W_t^2 \in \mathbb{R}^{d \times h},\ b_t^2 \in \mathbb{R}^d \}

with h=4dh=4d (hidden size typically four times input dimension). The output is

f(x;Wt)=x+LN(Wt2 σ(Wt1x+bt1)+bt2)f(x; W_t) = x + \mathrm{LN}\big(W_t^2\,\sigma(W_t^1 x + b_t^1) + b_t^2\big)

with LayerNorm and a residual connection (Sun et al., 2024). At each token, the hidden MLP is fine-tuned via self-supervised gradient descent on a per-token (or mini-batch) reconstruction loss.

The self-supervised loss is constructed by projecting the current token xtx_t into orthogonal "training," "label," and "test" views via learned linear maps θK,θV,θQ\theta_K, \theta_V, \theta_Q:

  • Training: x~t=θKxt\tilde{x}_t = \theta_K x_t
  • Label: yt=θVxty_t = \theta_V x_t
  • Test: zt=f(θQxt;Wt)z_t = f(\theta_Q x_t; W_t)

The loss is ℓ(W;xt)=∥f(x~t;W)−yt∥22\ell(W; x_t) = \|f(\tilde{x}_t; W) - y_t\|_2^2.

A single step of gradient descent updates the hidden MLP at time tt:

Wt=Wt−1−η∇W ℓ(Wt−1;xt)W_t = W_{t-1} - \eta \nabla_W\,\ell(W_{t-1}; x_t)

with η\eta possibly learned (Sun et al., 2024).

4. Computational Properties and Practical Considerations

The per-token cost for both forward and backward passes in TTT-MLP is O(d2)O(d^2) (as h=4dh=4d). For a context of length NN, total computation is O(Nd2)O(N d^2), achieving linear complexity in NN—key for long-sequence modeling. Unlike Transformers, which require O(N2d)O(N^2 d) due to self-attention, TTT-MLP stores only a fixed-size weight matrix rather than entire token histories.

For TT-MLP (the compressed MLP), each parameter update in ALS has cost O(M(Skrk−1rk)2+dSr2)O(M(S_k r_{k-1} r_k)^2 + d S r^2) per core, with only a few sweeps required for convergence, in contrast to hundreds of epochs for standard MLPs (Costa et al., 2021).

Memory I/O is a limiting factor in TTT-MLP due to d×dd \times d weight and gradient matrix updates per token. Techniques such as dual-form mini-batch computation reduce explicit gradient materialization, but bandwidth remains a constraint for large dd (Sun et al., 2024).

5. Empirical Evaluation and Performance Benchmarking

TT-MLP (Tensor-Train) has been tested on nonlinear regression (Mackey-Glass time series) and NASDAQ stock forecasting. Models with as few as 24-90 TT parameters match or outperform standard MLPs with comparable parameter counts. For example, in short-term time-series forecasting, TT-MLP matches or exceeds the best MLP test scores while converging in 2–10 sweeps (vs. 150–250 epochs) and exhibiting stability across random initializations, with standard deviation <10−310^{-3} (compared to ∼10−2\sim10^{-2} for MLPs) (Costa et al., 2021).

TTT-MLP (Test-Time Training) has demonstrated strong performance on language modeling at scale (125M–1.3B parameters). In long contexts (up to 32K tokens), TTT-MLP matches or surpasses Mamba by 0.5–1.5 perplexity points and shows comparable or better scaling than Transformers in this regime. The parameter overhead is modest, concentrated in the additional view-projection and learning-rate modules (Sun et al., 2024).

6. Comparative Analysis and Limitations

Architecture Context Complexity Hidden State Adaptivity
Transformer O(N2d)O(N^2 d) Memoryless None
Mamba-RNN O(Nd)O(N d) Vector (Fixed) None (fixed recurrence)
TTT-MLP O(Nd2)O(N d^2) MLP (Learned Online) Test-time (GD updates)
TT-MLP (ALS) N/A (feedforward) N/A ALS per dataset

A principal advantage of TTT-MLP is its ability to update the hidden-state MLP online, thereby retaining expressive capacity over long sequences, where fixed-state RNNs such as Mamba lose information. Versus Transformers, TTT-MLP is significantly more efficient for long contexts but incurs extra per-token computation and is limited by I/O. TT-MLP (ALS) provides extreme parameter compression with robust and rapid convergence, particularly useful in over-parameterized and resource-constrained settings.

Open questions include addressing the I/O bottleneck in test-time training and extending the model to richer objectives or architectures (deeper MLPs, convolutional inner models). There is ongoing work to explore greater systems-level optimization and scaling to even longer contexts or more complex modalities (Sun et al., 2024).

7. Future Prospects and Research Directions

Both TT-MLP frameworks, while targeting distinct problem spaces (parameter efficiency and long-context adaptation), highlight the utility of integrating tensor-based and meta-learning-inspired approaches with classical deep networks. Anticipated directions include:

  • Fused kernel and pipeline-parallel system co-design to mitigate memory bandwidth constraints in TTT-MLP.
  • Generalization of TTT-MLP with alternative self-supervised objectives, nested or hierarchical models, and extensions to modalities such as video (e.g., convolutional inner models).
  • Broader adoption of TT-decompositions (as in TT-MLP) for compressing and regularizing other network architectures beyond standard MLPs, especially in settings with strict resource budgets. This suggests a trend toward architectural hybrids that combine low-rank representations and "learning-to-learn" dynamics for scalable, robust deep learning (Costa et al., 2021, Sun et al., 2024).
Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TTT-MLP.