Papers
Topics
Authors
Recent
2000 character limit reached

TriMLP: Efficient MLP for Sequential Recommendation

Updated 18 December 2025
  • TriMLP is an MLP-like architecture for sequential recommendation, employing a triangular mixer with masked global and local kernels to enforce chronological token mixing.
  • It organizes token-mixing into global and local branches, effectively capturing both long- and short-term user preferences without relying on self-attention or convolutions.
  • TriMLP demonstrates an 83.5% reduction in MACs, 84% fewer parameters, and superior HR@5 performance, optimizing the accuracy-efficiency trade-off in diverse benchmarks.

TriMLP is an MLP-like architecture designed for sequential recommendation tasks, featuring a novel Triangular Mixer mechanism for efficient cross-token communication and rigorous prevention of information leakage under auto-regressive training. By imposing structured triangular masks on token-mixing weights and organizing mixing operations into global and local branches, TriMLP excavates long- and short-term user preference patterns without the complexity of self-attention or convolution, achieving a superior accuracy/efficiency trade-off across diverse benchmarks (Jiang et al., 2023).

1. Architecture and Layer-Wise Design

TriMLP processes a user's historical sequence XX of length nn, padded or truncated as necessary. The layerwise workflow is:

  • Embedding Layer: Each token xix_i is mapped to eiRde_i \in \mathbb{R}^d, yielding the embedded sequence XRn×dX \in \mathbb{R}^{n \times d}.
  • Triangular Mixer: Two alternating MLP-style token-mixing layers:
    • Global Mixing: Captures long-range dependencies across the entire sequence.
    • Local Mixing: Extracts short-range patterns within fixed-length sub-sessions.
  • Classifier: A linear output layer followed by softmax over the item vocabulary for every time step.
  • Training Mode: Auto-regressive protocol, with input X[1..n1]X[1..n-1] used to predict X[2..n]X[2..n].

The forward pipeline (omitting batch dimension) is:

1
2
3
X (n×d)      →      Embedding      →      TriangularMixer      →      Y (n×d)
                                                      ↓
                                  Linear + Softmax      →      P (n×|I|)

2. Triangular Mixer Mechanism

The Triangular Mixer introduces two essential learnable kernels: masked global and local mixing matrices, systematically preventing anti-chronological connections.

2.1 Masked Global Kernel

For sequential information preservation, the global mixer weight WGRn×nW^G \in \mathbb{R}^{n \times n} is masked to exclude tokens from future positions:

(MG)j,i={Wj,iG,ji ,j>i(M_G)_{j,i} = \begin{cases} W^G_{j,i}, & j \leq i \ -\infty, & j > i \end{cases}

The resulting kernel enforces strictly chronological cross-token aggregation, with column-wise softmax ensuring:

Aj,iG=exp((MG)j,i)k=1iexp((MG)k,i)A^G_{j,i} = \frac{\exp((M_G)_{j,i})}{\sum_{k=1}^i \exp((M_G)_{k,i})}

2.2 Masked Local Kernel

Local mixing focuses on intra-session relationships. Let ss be the number of sessions (each of length ll), such that n=sln = s \cdot l, and p(j)=j/lp(j) = \lceil j / l \rceil denotes the session index:

(ML)j,i={Wj,iL,p(j)=p(i) and ji ,otherwise(M_L)_{j,i} = \begin{cases} W^L_{j,i}, & p(j) = p(i) \ \text{and} \ j \leq i \ -\infty, & \text{otherwise} \end{cases}

By applying softmax in each session, local dependencies are captured without cross-session interference.

3. Mathematical Formulation

Let XG=XRd×nX_G = X^{\top} \in \mathbb{R}^{d \times n}. The full mixing operations are:

YG=Act(XGSoftmax(MG)),YL=Act(YGSoftmax(ML))Y_G = \text{Act}(X_G \cdot \text{Softmax}(M_G)), \qquad Y_L = \text{Act}(Y_G \cdot \text{Softmax}(M_L))

Y=(YL)Y = (Y_L)^{\top}

Where Act()\text{Act}(\cdot) is a nonlinearity, e.g., GELU or SiLU. All softmaxes are performed across the token dimension.

This yields the compact formula:

TriMix(X)=(Act(XSoftmax(MG))Softmax(ML))\text{TriMix}(X) = \left( \text{Act}(X^{\top} \text{Softmax}(M_G)) \cdot \text{Softmax}(M_L) \right)^{\top}

4. Computational Complexity and Efficiency

TriMLP employs two masked MLP layers, each with O(n2d)O(n^2 d) multiply-accumulate operations (MACs). Relative to prevailing baselines:

Model Complexity MACs Params GPU Mem Inf.Time HR@5
GRU4Rec O(nd2)O(nd^2) 13.04G 0.19M 2.5GB 1.17 s 0.1084
NextItNet O(knd2)O(knd^2) 26.03G 0.40M 1.8GB 1.67 s 0.1277
SASRec O(n2d)O(n^2d) 25.94G 0.40M 1.8GB 2.50 s 0.1172
FMLP4Rec O(ndlog(nd))O(nd\log(nd)) 17.18G 0.26M 2.1GB 2.29 s 0.0632
TriMLP O(n2d)O(n^2d) 2.15G 0.03M 1.4GB 0.89 s 0.1390

TriMLP exhibits an 83.5% reduction in MACs, 84% reduction in parameters, 23% reduction in GPU memory footprint, and 24% lower inference time compared to the strongest baseline, with an 8.9% gain in HR@5 (Jiang et al., 2023).

5. Experimental Methodology and Performance

TriMLP was empirically validated on 12 datasets spanning Amazon, MovieLens, Tenrec, and LBSN, representing 50K to 10M interactions. Preprocessing excludes users with fewer than 20 interactions and items with fewer than 10, and all sequences are padded/truncated to a fixed length of nn (range: 32–256). Evaluation metrics include HR@5/10 and NDCG@5/10, alongside averaged inference-time measurements.

Hyperparameters: embedding dim = 128, mixer hidden dim = nn, dropout = 0.5, Adam learning rate = 0.001, early stopping after 10 epochs without improvement. The experiments are conducted on a single 3090 GPU.

Across datasets, TriMLP yields up to 14.88% higher HR@10/NDCG@10 over state-of-the-art baselines, while reducing average inference time by 8.65%. Example scale-wise results:

Scale SOTA HR@10 TriMLP HR@10 Δ Inf.Time ↓
Tiny 0.1202 (FMLP4Rec) 0.1256 +4.5% −2.3%
Small 0.0572 (SASRec) 0.0611 +6.8% −0.4%
Base 0.2111 (NextItNet) 0.2343 +11.1% −6.7%
Large 0.1907 (NextItNet) 0.2027 +6.3% −23.7%

6. Training Algorithm and Pseudocode

The core masking and mixing logic is encapsulated as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
def generate_masks(n, l, s):
    M_G = torch.triu(torch.ones(n,n))       # upper-triangular mask
    session_mask = torch.zeros(n,n)
    for i in range(s):
        session_mask[i*l:(i+1)*l, i*l:(i+1)*l] = 1
    M_L = M_G * session_mask                # block-diagonal triangles
    return nn.Parameter(M_G), nn.Parameter(M_L)

class TriangularMixer(nn.Module):
    def __init__(self, n, l, s, act=GELU):
        super().__init__()
        self.M_G, self.M_L = generate_masks(n,l,s)
        self.act = act()
    def forward(self, X):
        X_G = X.permute(0,2,1)             # [B, d, n]
        A_G = (X_G @ self.M_G).softmax(dim=-1)
        Y_G = self.act(A_G)
        A_L = (Y_G @ self.M_L).softmax(dim=-1)
        Y_L = self.act(A_L)
        return Y_L.permute(0,2,1)          # [B, n, d]

High-level training loop:

1
2
3
4
5
6
for each batch of sequences X[1..n]:
    E = Embed(X)                  # [B,n,d]
    Y = TriangularMixer(E)        # [B,n,d]
    P = Softmax(Y @ W + b)        # [B,n,|I|]
    loss = cross_entropy(P[:, :-1], targets[:,1:])
    backprop update

7. Properties, Limitations, and Prospective Extensions

Strengths:

  • Restores chronological inductive bias to MLPs by triangular masking, eliminating information leakage.
  • Two-branch global/local mixing structure enables modeling of both long- and short-term user preferences.
  • Lightweight, memory-efficient, and highly parallelizable (no Q/K/V matrices or FFN structure).
  • Omits explicit positional embeddings, relying entirely on masked mixing.

Limitations:

  • Static mixing weights may underfit highly heterogeneous or irregular sequences due to input-agnostic design.
  • Requires tuning of session-size hyperparameters (ss or ll) for optimal local mixing.
  • Fixed input length nn; adaptation to variable-length streams necessitates additional modification.

Future Directions:

  • Incorporate temporal intervals or contextual information into the masking schema.
  • Explore data augmentation and pre-training strategies for sparse domains.
  • Investigate dynamic (input-adaptive) masking techniques to enhance representation flexibility.
  • Develop methods for handling variable-length sequences without mask retraining.

TriMLP revitalizes the simplicity and efficiency of MLP architectures for sequential recommendation, achieving parity or superiority with state-of-the-art methods while delivering substantial improvements in computational efficiency (Jiang et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to TriMLP.