Papers
Topics
Authors
Recent
Search
2000 character limit reached

SparseCTR: Scalable CTR Prediction

Updated 27 January 2026
  • SparseCTR is a CTR prediction framework that models long-term user behavior using personalized time-aware chunking and sparse self-attention.
  • It employs a three-branch EvoAttention mechanism to efficiently capture global, transition, and local interests in large-scale datasets.
  • The design achieves significant speedup and accuracy improvements while managing computational complexity in industrial recommender systems.

SparseCTR is a CTR prediction framework designed for efficient and effective modeling of long-term user behaviors in large-scale industrial recommender systems. It addresses the prohibitive computational complexity associated with conventional full self-attention architectures when applied to long user behavior sequences, while capturing domain-specific personalization and temporal dynamics. SparseCTR introduces a personalized time-aware chunking algorithm combined with a three-branch sparse self-attention module and head-wise composite relative temporal encoding, achieving improved accuracy and computational efficiency across several industry datasets (Lai et al., 25 Jan 2026).

1. Model Architecture

The architecture of SparseCTR consists of an input stage, multiple stacked SparseBlock layers, and a prediction layer.

Input Representation:

  • For each user, profile features U={u1,,uU}U = \{u_1, …, u_{|U|}\}, historical behaviors B={b1,,bB}B = \{b_1, …, b_{|B|}\} (with item embeddings and timestamps), and candidate items C={c1,,cC}C = \{c_1, …, c_{|C|}\} are collected.
  • All are concatenated into a mixed sequence S=[b1,,bB,c1,,cC]Rn×dS = [b_1, …, b_{|B|}, c_1, …, c_{|C|}] \in \mathbb{R}^{n \times d}, n=B+Cn = |B| + |C|, and profile features are embedded into eURUde_U \in \mathbb{R}^{|U| d}.
  • A composite mask MM ensures causality (position ii only attends to j<ij < i) and candidate isolation (candidates do not attend to each other).

SparseBlock Layer (repeated ll times):

  • Pre-normalization is performed with RMSNorm.
  • EvoAttention with composite relative temporal bias is applied.
  • SwiGLU-Feedforward network:

ESFFN=(ϕ(ESEvoW1)ESEvoW2)W3E_S^{\mathrm{FFN}} = (\phi(E_S^{\mathrm{Evo}} W_1) \odot E_S^{\mathrm{Evo}} W_2) W_3

where ϕ=Swish\phi = \mathrm{Swish}, W1,W2Rd×3dW_1,W_2 \in \mathbb{R}^{d \times 3d}, W3R3d×dW_3 \in \mathbb{R}^{3d \times d}.

Prediction Layer:

  • Candidate embeddings EC(l)E_C^{(l)} are fused with eUe_U and passed through an MLP for CTR prediction:

Y^=MLP(ReLU(EC(l)W4)sigmoid(eUW5))\hat{Y} = \mathrm{MLP}( \mathrm{ReLU}(E_C^{(l)} W_4) \odot \mathrm{sigmoid}(e_U W_5) )

  • The model is optimized using binary cross-entropy loss on all candidates.

2. Personalized Time-Aware Chunking

A key innovation in SparseCTR is the personalized time-aware chunking (“TimeChunking”) algorithm, which segments long behavior sequences based on the largest time-gaps to preserve the continuity of user actions and facilitate parallelism.

Procedure:

  • Compute time differences Δtk=tktk1\Delta t_k = t_k - t_{k-1}, k=1Bk=1…|B|.
  • Select the top-P|P| largest time-gaps as chunk boundaries.
  • Define variable-length chunks {p1,,pP}\{p_1, …, p_P\}, each containing consecutive behaviors between boundaries.
Step Operation Output
1 Δt\Delta t computation on B|B| behaviors Length-B|B| array
2 Select largest P|P| indices in Δt\Delta t Chunk boundaries indices
3 Sort, pad with 0 and B|B|, split accordingly PP chunks

This chunking ensures that sequences with abrupt temporal gaps are split appropriately and continuous interaction patterns are maintained.

3. Three-Branch EvoAttention Mechanism

SparseCTR's EvoAttention module sparsifies self-attention by jointly modeling three aspects of user behavior:

Global Attention (long-term interests):

  • For each chunk pjp_j, aggregate its behaviors via an MLP:

kpj=MLP({kb:bpj}),vpj=MLP({vb:bpj})k_{p_j} = \mathrm{MLP}(\{k_b: b \in p_j\}), \quad v_{p_j} = \mathrm{MLP}(\{v_b: b \in p_j\})

Then attend globally:

Aglob(Q,KP,VP)=softmax((QKP)/d+M)VPA_{\mathrm{glob}}(Q, K_P, V_P) = \mathrm{softmax}((Q K_P^\top)/\sqrt{d} + M) V_P

Transition Attention (interest shifts):

  • For each chunk, use the last mm behaviors (BB'), aggregate their key-value pairs, and compute attention:

Atrans(Q,KB,VB)=softmax((QKB)/d+M)VBA_{\mathrm{trans}}(Q, K_{B'}, V_{B'}) = \mathrm{softmax}((Q K_{B'}^\top)/\sqrt{d} + M) V_{B'}

Local Attention (short-term interests and user profile):

  • For each position ii, local window Bi={uc}{biw+1,...,bi}B''_i = \{u_c\} \cup \{b_{i-w+1}, ..., b_i\}.
  • Apply local self-attention:

Aloc(Q,KB,VB)=softmax((QKB)/d+M)VBA_{\mathrm{loc}}(Q, K_{B''}, V_{B''}) = \mathrm{softmax}((Q K_{B''}^\top)/{\sqrt{d}} + M) V_{B''}

Gated Fusion:

  • For each head, fuse outputs:

[α1,α2,α3]=softmax([Aglob,Atrans,Aloc]Wgate)[\alpha_1, \alpha_2, \alpha_3] = \mathrm{softmax}([A_{\mathrm{glob}},A_{\mathrm{trans}},A_{\mathrm{loc}}] \cdot W_\mathrm{gate})

A=α1Aglob+α2Atrans+α3AlocA = \alpha_1 A_{\mathrm{glob}} + \alpha_2 A_{\mathrm{trans}} + \alpha_3 A_{\mathrm{loc}}

Multi-head outputs are concatenated and projected:

ESEvo=concat(A(1),,A(H))WOE_S^{\mathrm{Evo}} = \mathrm{concat}(A^{(1)},…,A^{(H)}) W_O

4. Composite Relative Temporal Encoding

The composite relative temporal encoding (“RelTemporal”) enhances time awareness in each attention branch via learnable, head-specific biases, capturing the following three temporal relations:

  1. Relative Time (Bucketing):
    • Δtij=titj\Delta t_{ij} = |t_i - t_j|, b=log2Δtijb = \lfloor \log_2 \Delta t_{ij} \rfloor
    • bias1ij(h)=bs1(h)\mathrm{bias1}_{ij}^{(h)} = -b s_1^{(h)}
    • s1(h)s_1^{(h)} is learnable, initialized as a geometric sequence.
  2. Relative Hour (Periodicity):
    • Hij=(H_{ij} = (hour difference)mod24) \bmod 24
    • bias2ij(h)=sin(πHij/24)s2(h)\mathrm{bias2}_{ij}^{(h)} = -\sin(\pi H_{ij}/24) s_2^{(h)}
  3. Relative Weekend (Weekday vs. Weekend):
    • W(t)=1W(t) = 1 if weekend; $0$ else
    • bias3ij(h)=0\mathrm{bias3}_{ij}^{(h)} = 0 if W(ti)=W(tj)W(t_i)=W(t_j), 1s3(h)-1 \cdot s_3^{(h)} otherwise

Combined Bias:

  • biasij(h)=bias1+bias2+bias3\mathrm{bias}_{ij}^{(h)} = \mathrm{bias1} + \mathrm{bias2} + \mathrm{bias3}

The composite bias is incorporated into each attention computation:

softmax((QK)/d+bias)V\mathrm{softmax}((QK^\top)/\sqrt{d} + \mathrm{bias}) V

For global attention, ti,tjt_i, t_j denote average chunk timestamps.

5. Computational Complexity

SparseCTR achieves significant computational savings relative to full self-attention:

  • SparseCTR per block:

O(Bl[nPd+nmPd+nwd])O(B \cdot l \cdot [n P d + n m P d + n w d])

  • Full self-attention:

O(Bln2d)O(B \cdot l \cdot n^2 d)

Here, BB is batch size, ll the number of SparseBlocks, nn the sequence length, dd the model dimension, PP chunk count, mm transition window size, and ww local window; with P,m,wnP, m, w \ll n.

Parameter Count (excluding embeddings):

  • EvoAttention: O(4ld2+3ld)O(4l d^2 + 3l d)
  • SwiGLU-FFN: O(9ld2)O(9l d^2)

This sparsification enables handling industrial-scale sequences (e.g., length 1024) with constrained compute and memory budgets.

6. Empirical Evaluation

SparseCTR has been extensively validated on large-scale industrial datasets:

  • Industry (Meituan): 86.7M users, 20.2M items, 1.456B samples (2 years)
  • Alibaba Display Ads: 1.14M users, 461K items, 700M samples (22 days)
  • Ele.me click logs: 14.4M users, 7.45M items, 128M samples (30 days)

Baselines:

  • DIN, CAN (Group I); SoftSIM, HardSIM, ETA, TWIN-V2 (long-seq, Group II); BST, HSTU, LONGER, SUAN (attention-based, Group III).

Metrics: AUC and relative improvement (RelImpr over DIN).

Model AUC (Industry) RelImpr (%)
SparseCTR 0.7083 ±0.00004 8.49
SUAN (best baseline) 0.7040 6.25

Ablation Studies:

  • Removing EvoAttention for full self-attention reduces AUC (0.7083 → 0.7051).
  • Removing the global branch yields 0.7023.
  • Removing all temporal encodings yields 0.7015.
  • Each component contributes to state-of-the-art performance.

Comparative Analysis:

  • EvoAttention vs Dilated/NSAttention/KernelAttention: best AUC-speed trade-off.
  • TimeChunking outperforms SimChunking and Fixed-length chunking.
  • RelTemporal outperforms T5Bias, HSTUBias, Alibi, and RoPE.

Efficiency and Scaling Laws:

  • Varying (P,m,w)(P, m, w) delivers up to 8.9×8.9\times speedup over full self-attention, reducing GPU memory without sacrificing AUC.
  • AUC fits a power-law: AUC(X)=EA/XαAUC(X) = E - A/X^\alpha, R21R^2 \approx 1, demonstrating that higher FLOPs consistently yield better AUC across three orders of magnitude in FLOPs.

7. Deployment and Practical Considerations

Deployment Settings:

  • Embedding size d=32d = 32; MLP: [321][32 \rightarrow 1] with ReLU
  • l=2l = 2 SparseBlock layers, H=8H = 8 heads
  • Sequence lengths: $1024$ (Industry, Alibaba), $50$ (Ele.me)
  • Training: 1 epoch, Adam optimizer, NVIDIA A100-80G GPUs

Data Processing:

  • Offline data: listwise samples (U,B,C,Y)(U, B, C, Y) per exposure, enabling simultaneous evaluation of all C|C| candidates per forward pass.

Inference and Online Performance:

  • Causal encoder and mask yield single-pass inference for all candidates; average latency \approx 40ms.
  • Online A/B testing (1% traffic, 7 days): CTR increased by 1.72%, CPM by 1.41%, with no regression in latency or memory profile.

This suggests that the architecture is suitable for deployment in latency-sensitive industrial recommendation settings, providing both accuracy and resource efficiency (Lai et al., 25 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SparseCTR.