Time-Mixing MLPs: Efficient Temporal Modeling

Updated 12 February 2026

Time-mixing MLPs are deep learning architectures that use pure MLP blocks with explicit temporal mixing, replacing recurrences and self-attention.
They employ techniques like masked, grouped, and patchwise mixing to efficiently capture both short-term and long-range dependencies in sequential data.
These models excel in applications such as time series forecasting, sequential recommendation, and video recognition, providing competitive accuracy with reduced computational cost.

Time-mixing MLPs (multi-layer perceptrons) are a class of deep learning architectures designed to model dependencies across the temporal (sequence) dimension by leveraging pure MLP blocks instead of recurrent mechanisms or self-attention. These models reorganize mixing operations along the time axis to capture temporal relationships in sequential, time series, or spatiotemporal data, combining high efficiency with domain-appropriate inductive biases. Time-mixing MLPs have demonstrated efficacy across diverse domains, notably in sequential recommendation, large-scale traffic forecasting, multivariate time series prediction, and video recognition.

1. Core Design Patterns in Time-Mixing MLPs

Time-mixing MLPs are instantiated via a variety of architectural patterns, each characterized by explicit mixing along the temporal axis—typically via linear or nonlinear feedforward layers with weight sharing and masking strategies.

MLP-mixer principle: The canonical approach, inherited from vision MLP-Mixer networks, decomposes mixing operations into two stages: one MLP operates along the "token" (time) axis, and another along the "channel" (feature) axis. In time-mixing variants, the temporal axis takes the role of "token" (Chen et al., 2023).
Causal and masked mixing: Auto-regressive or causal modeling is enforced by explicitly masking weights in the temporal MLP to eliminate anti-chronological information flow. The TriMLP’s Triangular Mixer, for example, applies a strictly upper-triangular mask to the token-mixing MLP, followed by row-wise softmax to implement causal "attention-like" mixing (Jiang et al., 2023).
Grouped or blockwise time mixing: To balance receptive field and computational load, inputs are often decomposed into temporal groups or patches, enabling both local and long-range temporal interactions, as in Grouped Time Mixing (GTM) (Qiu et al., 2022) or window aggregation (Wang et al., 26 Nov 2025, Zhang et al., 2024, Murad et al., 2024). Overlapping and hierarchical grouping further modulate the temporal receptive field.
Specialized nonlinearities: Non-standard nonlinear transformations increase expressiveness; e.g., KAN-based time-mixing replaces MLPs with Kolmogorov–Arnold layers, learning per-coordinate spline-basis functions (Hong et al., 25 Feb 2025).

Table 1 below summarizes selected representative time-mixing MLP architectures and their key design strategies.

Model	Temporal Mixing Mechanism	Inductive Bias/Scope
TriMLP	Triangular mask + softmax	Causal, global + local
TSMixer	Linear MLP across time, residual	Global, linear/nonlinear
PreMixer	Patchwise MLP, patch-embedding	Patch-global, linear
HSTMixer	Windowed aggregation, hierarchical	Hierarchical, local/global
WPMixer	Wavelet decomposition + patch mixing	Multi-resolution, hierarchical
MLP-3D	Grouped Time Mixing (GTM)	Patch/block, video temporal
RPMixer	Blockwise random projections	Diversity, ensemble-like
TSKANMixer	KAN splines replacing MLP	Highly nonlinear, universal

2. Mathematical Formulations and Operations

The essential operation in time-mixing MLPs is a fully connected layer applied along the temporal axis. In its most generic form, given an input $X \in \mathbb{R}^{L \times F}$ (look-back length $L$ , $F$ features), a single time-mixing MLP applies:

$X^{(t)} = X W^{(t)} + b^{(t)}, \qquad W^{(t)} \in \mathbb{R}^{L \times L}$

Enhancements include:

Masked Linear Mixing: Setting $W^{(t)}_{ij} = 0$ for $j > i$ ensures causal processing (Jiang et al., 2023).
Softmax Normalization: Applying a row-wise softmax to masked $W$ emulates attention-like probability mixing.
Patchwise Mixing: Temporal windows or patches ( $X \in \mathbb{R}^{N \times t \times C}$ ) are vectorized and embedded via learned linear (or two-layer MLP) projections, followed by mixing (Zhang et al., 2024, Murad et al., 2024).
Hierarchical Aggregation and Upsampling: In HSTMixer, bottom-up windowed aggregation reduces temporal resolution (window size $p$ ), followed by top-down re-expansion through learned upsampling (Wang et al., 26 Nov 2025).
Random Projections: RPMixer injects fixed, block-specific random matrices, projecting along the spatial (node) dimension to encourage diversity and ensemble effects (Yeh et al., 2024).

In Table 2, these formulation types are mapped to representative operations and intended modeling effects.

Mixing Technique	Equation/Process	Modeling Implication
Masked linear mixing	$X' = (M \odot W) X$	Enforces chronology (causality)
Patchwise embedding	$z = W_2 \sigma(W_1 x + b_1) + b_2$	Local context, efficient computation
Hierarchical	$E_\ell \to$ agg./propagation $H_\ell$	Multi-resolution, compression/expansion
Random projections	$Z = P\,\text{ReLU}(X^T)$ , $S = W_2 Z$	Block diversity, ensemble averaging

3. Variants and Hybridizations

Several time-mixing MLP models combine multiple mixing levels, adaptive parameterizations, and specialized nonlinearities to improve expressiveness and scalability.

Alternating global/local mixing: TriMLP alternates full-sequence (global) and local window block mixing, using separate triangular masks to enable both long-range dependencies and short-term pattern capture. A two-stage serial application is empirically superior to any parallel or concatenated mixing (Jiang et al., 2023).
Grouped Time Mixing (GTM): MLP-3D introduces four core GTM blocks (short-range, long-range, shift-window, shift-token), enabling tunable windows and strided temporal grouping; Toeplitz block parametric sharing is used for further parameter efficiency (Qiu et al., 2022).
Hierarchical spatiotemporal mixing: HSTMixer employs a sequence of bottom-up (temporal aggregation) and top-down (temporal upsampling) mixers, combined with adaptive region-wise mixers for spatial specificity (Wang et al., 26 Nov 2025).
Wavelet and multi-resolution: WPMixer decomposes each variable’s time series into wavelet sub-bands, mixes each at its temporal resolution with per-branch MLPs, and reconstructs by inverse wavelet transform (Murad et al., 2024).
Nonlinear, universal mixing (KANs): TSKANMixer replaces MLP with Kolmogorov–Arnold networks, parameterizing each edge via B-splines; this structure deploys per-coordinate learnable non-linearities to capture intricate multi-lag interactions at scale, at the expense of higher computational cost (Hong et al., 25 Feb 2025).
Pre-training enhanced mixers: PreMixer incorporates patchwise masked MLP pre-training for robust representation learning, combining reconstruction and contrastive loss in the patch-embedding stage, followed by global time-mixing in downstream forecasting (Zhang et al., 2024).

4. Comparative Efficiency, Scalability, and Performance

Time-mixing MLP architectures are designed to offer competitive or superior prediction accuracy with substantially reduced computational complexity compared to attention-based (transformer) or convolutional models, particularly for long sequences.

Parameter and compute complexity: TriMLP’s mixer implements $O(T^2d)$ MACs per layer (for $T=128$ , $d=128$ , two weight matrices totalling $\approx$ 0.03M parameters), compared to SASRec’s 0.4M parameters and $\sim$ 26G MACs per batch (Jiang et al., 2023). WPMixer achieves sub-linear GFLOPs relative to TimeMixer or self-attention due to patching and downsampling (Murad et al., 2024).
Linear scaling: HSTMixer and PreMixer exhibit $O(NTD)$ complexity (nodes, time steps, embedding dim), versus $O((NT)^2)$ for full attention; HSTMixer achieves 4.5h end-to-end training on GBA (2352 nodes) versus 7–59h for transformer/GNN baselines (Wang et al., 26 Nov 2025, Zhang et al., 2024).
Empirical benchmarks: On sequential recommendation, TriMLP gains mean 14.9% Hit Rate & NDCG over SASRec/NextItNet with 8.7% reduced inference time. On large-scale traffic forecasting, HSTMixer outperforms 20 baselines by 4.41% MAE/RMSE, and RPMixer surpasses graph neural network models on GBA (MAE 19.06 vs 20.71) and CA (MAE 17.50 vs 19.86 for TSMixer baseline) (Jiang et al., 2023, Wang et al., 26 Nov 2025, Yeh et al., 2024).
Trade-offs: Pure MLP time-mixing is highly efficient but may lack data-adaptive weighting of remote time steps, although hybridizations (e.g., adaptive region mixing, KANs, random projections) partially address this expressivity limitation.

5. Domain-Specific Adaptations and Applications

Time-mixing MLPs have been extensively deployed in:

Time series forecasting: TSMixer, PreMixer, HSTMixer, WPMixer, and TSKANMixer operate on short/long univariate or multivariate time-series windows—using patching, global token mixing, hierarchical windows, or universal nonlinear mixing—consistently out-performing or matching transformer baselines at a fraction of compute (Chen et al., 2023, Zhang et al., 2024, Wang et al., 26 Nov 2025, Murad et al., 2024, Hong et al., 25 Feb 2025).
Sequential recommendation: TriMLP leverages causal triangular mixing to recover both user’s fine-grained item preferences and long-range behavioral trends under the standard auto-regressive paradigm (Jiang et al., 2023).
Large-scale spatial-temporal prediction: HSTMixer and RPMixer apply adaptive region/time mixing and blockwise diversity via random projections to accurately predict urban traffic at city-scale node graphs ( $N \sim 10^4$ ) (Wang et al., 26 Nov 2025, Yeh et al., 2024).
Video recognition: Grouped Time Mixing enables 3D-MLP architectures (MLP-3D) to process video clips with reduced parameter count compared to 3D CNNs or transformers, leveraging blockwise/strided mixing to capture both short-term dynamics and inter-frame semantics; accuracy is competitive with SoTA vision models (Qiu et al., 2022).

6. Practical Considerations and Limitations

Positional encoding: Masked or patchwise time mixing often obviates the need for explicit positional embeddings; TriMLP and some mixer models rely solely on mask-induced chronology (Jiang et al., 2023).
Initialization: Proper initialization (e.g., "1-0 init" for triangular masks) and normalization (e.g., row-wise softmax, batch/layer norm) are critical for stable training (Jiang et al., 2023, Murad et al., 2024).
Dropout and regularization: High dropout rates and residual connections help mitigate overfitting, particularly for uniformly shared time-mixing weights (Chen et al., 2023, Hong et al., 25 Feb 2025).
Expressivity: Fixed weight time-mixing MLPs are inherently less expressive than data-dependent attention, although hybrid architectures (e.g., KANs, adaptive region mixers) increase representational power at increased cost (Hong et al., 25 Feb 2025).
Efficiency–accuracy trade-off: For long look-back or large node counts, time-mixing MLPs outperform attention mechanisms computationally; however, for modalities demanding dynamic time-step relevance, attention may retain some advantage.

7. Future Directions and Research Frontiers

Research on time-mixing MLPs continues along several axes:

Adaptive mixing mechanisms: Ongoing work explores model structures enabling dynamically learned attention-like mechanisms atop static MLP mixing, as exemplified by adaptive region mixers and spline-based nonlinearities (Wang et al., 26 Nov 2025, Hong et al., 25 Feb 2025).
Multi-resolution and wavelet integration: Techniques combining multi-resolution signal processing (wavelet decomposition) with patch mixing are likely to enable even more scalable handling of ultra-long sequences (Murad et al., 2024).
Hybridization with non-MLP modules: Random projections, ensemble structures, and KANs suggest that time-mixing MLPs can be further hybridized for efficiency–expressivity balance (Yeh et al., 2024, Hong et al., 25 Feb 2025).
Application diversity: Time-mixing MLPs have demonstrated generality from traffic, industrial, and retail forecasting to video understanding and sequential recommendation, motivating further study into architecture adaptation for emerging temporal domains.

In summary, time-mixing MLPs constitute a highly active research area that extends the MLP-mixer paradigm to temporal and spatiotemporal data, offering attractive trade-offs of scalability, simplicity, and competitive accuracy via explicit, often masked or grouped, mixing along the time axis. Their design continues to evolve in response to domain-specific challenges and the need for modeling increasingly complex, large-scale sequential phenomena (Jiang et al., 2023, Chen et al., 2023, Hong et al., 25 Feb 2025, Wang et al., 26 Nov 2025, Qiu et al., 2022, Zhang et al., 2024, Murad et al., 2024, Yeh et al., 2024).