TriMLP: Efficient MLP for Sequential Recommendation
- TriMLP is an MLP-like architecture for sequential recommendation, employing a triangular mixer with masked global and local kernels to enforce chronological token mixing.
- It organizes token-mixing into global and local branches, effectively capturing both long- and short-term user preferences without relying on self-attention or convolutions.
- TriMLP demonstrates an 83.5% reduction in MACs, 84% fewer parameters, and superior HR@5 performance, optimizing the accuracy-efficiency trade-off in diverse benchmarks.
TriMLP is an MLP-like architecture designed for sequential recommendation tasks, featuring a novel Triangular Mixer mechanism for efficient cross-token communication and rigorous prevention of information leakage under auto-regressive training. By imposing structured triangular masks on token-mixing weights and organizing mixing operations into global and local branches, TriMLP excavates long- and short-term user preference patterns without the complexity of self-attention or convolution, achieving a superior accuracy/efficiency trade-off across diverse benchmarks (Jiang et al., 2023).
1. Architecture and Layer-Wise Design
TriMLP processes a user's historical sequence of length , padded or truncated as necessary. The layerwise workflow is:
- Embedding Layer: Each token is mapped to , yielding the embedded sequence .
- Triangular Mixer: Two alternating MLP-style token-mixing layers:
- Global Mixing: Captures long-range dependencies across the entire sequence.
- Local Mixing: Extracts short-range patterns within fixed-length sub-sessions.
- Classifier: A linear output layer followed by softmax over the item vocabulary for every time step.
- Training Mode: Auto-regressive protocol, with input used to predict .
The forward pipeline (omitting batch dimension) is:
1 2 3 |
X (n×d) → Embedding → TriangularMixer → Y (n×d)
↓
Linear + Softmax → P (n×|I|) |
2. Triangular Mixer Mechanism
The Triangular Mixer introduces two essential learnable kernels: masked global and local mixing matrices, systematically preventing anti-chronological connections.
2.1 Masked Global Kernel
For sequential information preservation, the global mixer weight is masked to exclude tokens from future positions:
The resulting kernel enforces strictly chronological cross-token aggregation, with column-wise softmax ensuring:
2.2 Masked Local Kernel
Local mixing focuses on intra-session relationships. Let be the number of sessions (each of length ), such that , and denotes the session index:
By applying softmax in each session, local dependencies are captured without cross-session interference.
3. Mathematical Formulation
Let . The full mixing operations are:
Where is a nonlinearity, e.g., GELU or SiLU. All softmaxes are performed across the token dimension.
This yields the compact formula:
4. Computational Complexity and Efficiency
TriMLP employs two masked MLP layers, each with multiply-accumulate operations (MACs). Relative to prevailing baselines:
| Model | Complexity | MACs | Params | GPU Mem | Inf.Time | HR@5 |
|---|---|---|---|---|---|---|
| GRU4Rec | 13.04G | 0.19M | 2.5GB | 1.17 s | 0.1084 | |
| NextItNet | 26.03G | 0.40M | 1.8GB | 1.67 s | 0.1277 | |
| SASRec | 25.94G | 0.40M | 1.8GB | 2.50 s | 0.1172 | |
| FMLP4Rec | 17.18G | 0.26M | 2.1GB | 2.29 s | 0.0632 | |
| TriMLP | 2.15G | 0.03M | 1.4GB | 0.89 s | 0.1390 |
TriMLP exhibits an 83.5% reduction in MACs, 84% reduction in parameters, 23% reduction in GPU memory footprint, and 24% lower inference time compared to the strongest baseline, with an 8.9% gain in HR@5 (Jiang et al., 2023).
5. Experimental Methodology and Performance
TriMLP was empirically validated on 12 datasets spanning Amazon, MovieLens, Tenrec, and LBSN, representing 50K to 10M interactions. Preprocessing excludes users with fewer than 20 interactions and items with fewer than 10, and all sequences are padded/truncated to a fixed length of (range: 32–256). Evaluation metrics include HR@5/10 and NDCG@5/10, alongside averaged inference-time measurements.
Hyperparameters: embedding dim = 128, mixer hidden dim = , dropout = 0.5, Adam learning rate = 0.001, early stopping after 10 epochs without improvement. The experiments are conducted on a single 3090 GPU.
Across datasets, TriMLP yields up to 14.88% higher HR@10/NDCG@10 over state-of-the-art baselines, while reducing average inference time by 8.65%. Example scale-wise results:
| Scale | SOTA HR@10 | TriMLP HR@10 | Δ | Inf.Time ↓ |
|---|---|---|---|---|
| Tiny | 0.1202 (FMLP4Rec) | 0.1256 | +4.5% | −2.3% |
| Small | 0.0572 (SASRec) | 0.0611 | +6.8% | −0.4% |
| Base | 0.2111 (NextItNet) | 0.2343 | +11.1% | −6.7% |
| Large | 0.1907 (NextItNet) | 0.2027 | +6.3% | −23.7% |
6. Training Algorithm and Pseudocode
The core masking and mixing logic is encapsulated as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
def generate_masks(n, l, s): M_G = torch.triu(torch.ones(n,n)) # upper-triangular mask session_mask = torch.zeros(n,n) for i in range(s): session_mask[i*l:(i+1)*l, i*l:(i+1)*l] = 1 M_L = M_G * session_mask # block-diagonal triangles return nn.Parameter(M_G), nn.Parameter(M_L) class TriangularMixer(nn.Module): def __init__(self, n, l, s, act=GELU): super().__init__() self.M_G, self.M_L = generate_masks(n,l,s) self.act = act() def forward(self, X): X_G = X.permute(0,2,1) # [B, d, n] A_G = (X_G @ self.M_G).softmax(dim=-1) Y_G = self.act(A_G) A_L = (Y_G @ self.M_L).softmax(dim=-1) Y_L = self.act(A_L) return Y_L.permute(0,2,1) # [B, n, d] |
High-level training loop:
1 2 3 4 5 6 |
for each batch of sequences X[1..n]: E = Embed(X) # [B,n,d] Y = TriangularMixer(E) # [B,n,d] P = Softmax(Y @ W + b) # [B,n,|I|] loss = cross_entropy(P[:, :-1], targets[:,1:]) backprop update |
7. Properties, Limitations, and Prospective Extensions
Strengths:
- Restores chronological inductive bias to MLPs by triangular masking, eliminating information leakage.
- Two-branch global/local mixing structure enables modeling of both long- and short-term user preferences.
- Lightweight, memory-efficient, and highly parallelizable (no Q/K/V matrices or FFN structure).
- Omits explicit positional embeddings, relying entirely on masked mixing.
Limitations:
- Static mixing weights may underfit highly heterogeneous or irregular sequences due to input-agnostic design.
- Requires tuning of session-size hyperparameters ( or ) for optimal local mixing.
- Fixed input length ; adaptation to variable-length streams necessitates additional modification.
Future Directions:
- Incorporate temporal intervals or contextual information into the masking schema.
- Explore data augmentation and pre-training strategies for sparse domains.
- Investigate dynamic (input-adaptive) masking techniques to enhance representation flexibility.
- Develop methods for handling variable-length sequences without mask retraining.
TriMLP revitalizes the simplicity and efficiency of MLP architectures for sequential recommendation, achieving parity or superiority with state-of-the-art methods while delivering substantial improvements in computational efficiency (Jiang et al., 2023).