TTT-MLP: Test-Time Training & Tensor-Train MLP

Updated 25 February 2026

TTT-MLP is a dual framework combining Tensor-Train MLP for 95% parameter compression with a test-time trainable MLP hidden state for dynamic sequence modeling.
The Tensor-Train variant employs Alternating Least Squares to update low-order tensor cores, achieving rapid convergence and robust performance.
The test-time training MLP adapts per token using self-supervised gradient updates, providing linear computational complexity ideal for long-context tasks.

TTT-MLP refers to two distinct but technically significant architectures that leverage multilayer perceptrons (MLPs) in unconventional ways: (1) as a highly parameter-efficient variant of classical MLPs via Tensor-Train decomposition (Costa et al., 2021), and (2) as a test-time trainable RNN layer whose hidden state itself consists of a two-layer MLP, optimized by self-supervised learning during inference (Sun et al., 2024). Both interpretations address key deficiencies in mainstream deep learning frameworks: the former targets storage and robustness, the latter, expressiveness and scalability for long-context sequence modeling.

1. Tensor-Train MLP: Foundations and Methodology

The Tensor-Train MLP (TT-MLP) is based on the Tensor-Train (TT) decomposition, which represents a high-dimensional tensor as a sequential product of low-order tensor "cores." For a $d$ th-order tensor $W\in\mathbb{R}^{n_1\times n_2\times\cdots\times n_d}$ , each entry is expressed as

$W[i_1,i_2,\dots,i_d] = \sum_{\alpha_0=1}^{r_0}\sum_{\alpha_1=1}^{r_1}\cdots\sum_{\alpha_d=1}^{r_d} G_1[\alpha_0,i_1,\alpha_1]\; G_2[\alpha_1,i_2,\alpha_2]\; \cdots\; G_d[\alpha_{d-1},i_d,\alpha_d]$

where each core $G_k$ is a 3D tensor of shape $r_{k-1} \times n_k \times r_k$ with $r_0 = r_d = 1$ and $\{r_k\}$ the TT-ranks (Costa et al., 2021).

To compress an MLP's weights, the $M \times N$ fully connected layer is reshaped into a $2d$-order tensor and approximated via its TT form. The resulting parameter count is

$\sum_{k=1}^d r_{k-1} m_k n_k r_k = O(dn r^2)$

whereas direct parameterization is $O(MN)$ . Compression of up to 95% is typical with negligible loss in accuracy.

2. Alternating Least Squares Optimization in TT-MLP

Training TT-MLP layers employs an Alternating Least Squares (ALS) scheme, updating each TT-core in turn while fixing others. For a batch of input features $\Phi(x)$ and target $y$ , the model constructs an output by contracting cores and features, then optimizes each core via regularized least squares:

$\hat{\theta}_k = (P_k^\top P_k + \lambda M L_k^\top L_k)^{-1} P_k^\top y$

where $P_k$ encodes the product of "environment" matrices and local features, and $L_k$ enforces an $\ell_2$ (Tikhonov) penalty. Each update is efficient, typically requiring only a small number of sweeps for convergence, and notably robust to random initialization (Costa et al., 2021).

3. TTT-MLP: Test-Time Training with MLP Hidden State

In the "Test-Time Training" (TTT-MLP) variant, the hidden state at each sequence step is the set of weights of a two-layer MLP:

$W_t = \{ W_t^1 \in \mathbb{R}^{h \times d},\ b_t^1 \in \mathbb{R}^h,\ W_t^2 \in \mathbb{R}^{d \times h},\ b_t^2 \in \mathbb{R}^d \}$

with $h=4d$ (hidden size typically four times input dimension). The output is

$f(x; W_t) = x + \mathrm{LN}\big(W_t^2\,\sigma(W_t^1 x + b_t^1) + b_t^2\big)$

with LayerNorm and a residual connection (Sun et al., 2024). At each token, the hidden MLP is fine-tuned via self-supervised gradient descent on a per-token (or mini-batch) reconstruction loss.

The self-supervised loss is constructed by projecting the current token $x_t$ into orthogonal "training," "label," and "test" views via learned linear maps $\theta_K, \theta_V, \theta_Q$ :

Training: $\tilde{x}_t = \theta_K x_t$
Label: $y_t = \theta_V x_t$
Test: $z_t = f(\theta_Q x_t; W_t)$

The loss is $\ell(W; x_t) = \|f(\tilde{x}_t; W) - y_t\|_2^2$ .

A single step of gradient descent updates the hidden MLP at time $t$ :

$W_t = W_{t-1} - \eta \nabla_W\,\ell(W_{t-1}; x_t)$

with $\eta$ possibly learned (Sun et al., 2024).

4. Computational Properties and Practical Considerations

The per-token cost for both forward and backward passes in TTT-MLP is $O(d^2)$ (as $h=4d$ ). For a context of length $N$ , total computation is $O(N d^2)$ , achieving linear complexity in $N$ —key for long-sequence modeling. Unlike Transformers, which require $O(N^2 d)$ due to self-attention, TTT-MLP stores only a fixed-size weight matrix rather than entire token histories.

For TT-MLP (the compressed MLP), each parameter update in ALS has cost $O(M(S_k r_{k-1} r_k)^2 + d S r^2)$ per core, with only a few sweeps required for convergence, in contrast to hundreds of epochs for standard MLPs (Costa et al., 2021).

Memory I/O is a limiting factor in TTT-MLP due to $d \times d$ weight and gradient matrix updates per token. Techniques such as dual-form mini-batch computation reduce explicit gradient materialization, but bandwidth remains a constraint for large $d$ (Sun et al., 2024).

5. Empirical Evaluation and Performance Benchmarking

TT-MLP (Tensor-Train) has been tested on nonlinear regression (Mackey-Glass time series) and NASDAQ stock forecasting. Models with as few as 24-90 TT parameters match or outperform standard MLPs with comparable parameter counts. For example, in short-term time-series forecasting, TT-MLP matches or exceeds the best MLP test scores while converging in 2–10 sweeps (vs. 150–250 epochs) and exhibiting stability across random initializations, with standard deviation < $10^{-3}$ (compared to $\sim10^{-2}$ for MLPs) (Costa et al., 2021).

TTT-MLP (Test-Time Training) has demonstrated strong performance on language modeling at scale (125M–1.3B parameters). In long contexts (up to 32K tokens), TTT-MLP matches or surpasses Mamba by 0.5–1.5 perplexity points and shows comparable or better scaling than Transformers in this regime. The parameter overhead is modest, concentrated in the additional view-projection and learning-rate modules (Sun et al., 2024).

6. Comparative Analysis and Limitations

Architecture	Context Complexity	Hidden State	Adaptivity
Transformer	$O(N^2 d)$	Memoryless	None
Mamba-RNN	$O(N d)$	Vector (Fixed)	None (fixed recurrence)
TTT-MLP	$O(N d^2)$	MLP (Learned Online)	Test-time (GD updates)
TT-MLP (ALS)	N/A (feedforward)	N/A	ALS per dataset

A principal advantage of TTT-MLP is its ability to update the hidden-state MLP online, thereby retaining expressive capacity over long sequences, where fixed-state RNNs such as Mamba lose information. Versus Transformers, TTT-MLP is significantly more efficient for long contexts but incurs extra per-token computation and is limited by I/O. TT-MLP (ALS) provides extreme parameter compression with robust and rapid convergence, particularly useful in over-parameterized and resource-constrained settings.

Open questions include addressing the I/O bottleneck in test-time training and extending the model to richer objectives or architectures (deeper MLPs, convolutional inner models). There is ongoing work to explore greater systems-level optimization and scaling to even longer contexts or more complex modalities (Sun et al., 2024).

7. Future Prospects and Research Directions

Both TT-MLP frameworks, while targeting distinct problem spaces (parameter efficiency and long-context adaptation), highlight the utility of integrating tensor-based and meta-learning-inspired approaches with classical deep networks. Anticipated directions include:

Fused kernel and pipeline-parallel system co-design to mitigate memory bandwidth constraints in TTT-MLP.
Generalization of TTT-MLP with alternative self-supervised objectives, nested or hierarchical models, and extensions to modalities such as video (e.g., convolutional inner models).
Broader adoption of TT-decompositions (as in TT-MLP) for compressing and regularizing other network architectures beyond standard MLPs, especially in settings with strict resource budgets. This suggests a trend toward architectural hybrids that combine low-rank representations and "learning-to-learn" dynamics for scalable, robust deep learning (Costa et al., 2021, Sun et al., 2024).

Markdown Report Issue Upgrade to Chat

References (2)

Tensor-Train Networks for Learning Predictive Modeling of Multidimensional Data (2021)

Learning to (Learn at Test Time): RNNs with Expressive Hidden States (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TTT-MLP.

TTT-MLP: Test-Time Training & Tensor-Train MLP

1. Tensor-Train MLP: Foundations and Methodology

2. Alternating Least Squares Optimization in TT-MLP

3. TTT-MLP: Test-Time Training with MLP Hidden State

4. Computational Properties and Practical Considerations

5. Empirical Evaluation and Performance Benchmarking

6. Comparative Analysis and Limitations

7. Future Prospects and Research Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

TTT-MLP: Test-Time Training & Tensor-Train MLP

1. Tensor-Train MLP: Foundations and Methodology

2. Alternating Least Squares Optimization in TT-MLP

3. TTT-MLP: Test-Time Training with MLP Hidden State

4. Computational Properties and Practical Considerations

5. Empirical Evaluation and Performance Benchmarking

6. Comparative Analysis and Limitations

7. Future Prospects and Research Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research