SparseCTR: Scalable CTR Prediction

Updated 27 January 2026

SparseCTR is a CTR prediction framework that models long-term user behavior using personalized time-aware chunking and sparse self-attention.
It employs a three-branch EvoAttention mechanism to efficiently capture global, transition, and local interests in large-scale datasets.
The design achieves significant speedup and accuracy improvements while managing computational complexity in industrial recommender systems.

SparseCTR is a CTR prediction framework designed for efficient and effective modeling of long-term user behaviors in large-scale industrial recommender systems. It addresses the prohibitive computational complexity associated with conventional full self-attention architectures when applied to long user behavior sequences, while capturing domain-specific personalization and temporal dynamics. SparseCTR introduces a personalized time-aware chunking algorithm combined with a three-branch sparse self-attention module and head-wise composite relative temporal encoding, achieving improved accuracy and computational efficiency across several industry datasets (Lai et al., 25 Jan 2026).

1. Model Architecture

The architecture of SparseCTR consists of an input stage, multiple stacked SparseBlock layers, and a prediction layer.

Input Representation:

For each user, profile features $U = \{u_1, …, u_{|U|}\}$ , historical behaviors $B = \{b_1, …, b_{|B|}\}$ (with item embeddings and timestamps), and candidate items $C = \{c_1, …, c_{|C|}\}$ are collected.
All are concatenated into a mixed sequence $S = [b_1, …, b_{|B|}, c_1, …, c_{|C|}] \in \mathbb{R}^{n \times d}$ , $n = |B| + |C|$ , and profile features are embedded into $e_U \in \mathbb{R}^{|U| d}$ .
A composite mask $M$ ensures causality (position $i$ only attends to $j < i$ ) and candidate isolation (candidates do not attend to each other).

SparseBlock Layer (repeated $l$ times):

Pre-normalization is performed with RMSNorm.
EvoAttention with composite relative temporal bias is applied.
SwiGLU-Feedforward network:

$E_S^{\mathrm{FFN}} = (\phi(E_S^{\mathrm{Evo}} W_1) \odot E_S^{\mathrm{Evo}} W_2) W_3$

where $\phi = \mathrm{Swish}$ , $W_1,W_2 \in \mathbb{R}^{d \times 3d}$ , $W_3 \in \mathbb{R}^{3d \times d}$ .

Prediction Layer:

Candidate embeddings $E_C^{(l)}$ are fused with $e_U$ and passed through an MLP for CTR prediction:

$\hat{Y} = \mathrm{MLP}( \mathrm{ReLU}(E_C^{(l)} W_4) \odot \mathrm{sigmoid}(e_U W_5) )$

The model is optimized using binary cross-entropy loss on all candidates.

2. Personalized Time-Aware Chunking

A key innovation in SparseCTR is the personalized time-aware chunking (“TimeChunking”) algorithm, which segments long behavior sequences based on the largest time-gaps to preserve the continuity of user actions and facilitate parallelism.

Procedure:

Compute time differences $\Delta t_k = t_k - t_{k-1}$ , $k=1…|B|$ .
Select the top- $|P|$ largest time-gaps as chunk boundaries.
Define variable-length chunks $\{p_1, …, p_P\}$ , each containing consecutive behaviors between boundaries.

Step	Operation	Output
1	$\Delta t$ computation on $\|B\|$ behaviors	Length- $\|B\|$ array
2	Select largest $\|P\|$ indices in $\Delta t$	Chunk boundaries indices
3	Sort, pad with 0 and $\|B\|$ , split accordingly	$P$ chunks

This chunking ensures that sequences with abrupt temporal gaps are split appropriately and continuous interaction patterns are maintained.

3. Three-Branch EvoAttention Mechanism

SparseCTR's EvoAttention module sparsifies self-attention by jointly modeling three aspects of user behavior:

Global Attention (long-term interests):

For each chunk $p_j$ , aggregate its behaviors via an MLP:

$k_{p_j} = \mathrm{MLP}(\{k_b: b \in p_j\}), \quad v_{p_j} = \mathrm{MLP}(\{v_b: b \in p_j\})$

Then attend globally:

$A_{\mathrm{glob}}(Q, K_P, V_P) = \mathrm{softmax}((Q K_P^\top)/\sqrt{d} + M) V_P$

Transition Attention (interest shifts):

For each chunk, use the last $m$ behaviors ( $B'$ ), aggregate their key-value pairs, and compute attention:

$A_{\mathrm{trans}}(Q, K_{B'}, V_{B'}) = \mathrm{softmax}((Q K_{B'}^\top)/\sqrt{d} + M) V_{B'}$

Local Attention (short-term interests and user profile):

For each position $i$ , local window $B''_i = \{u_c\} \cup \{b_{i-w+1}, ..., b_i\}$ .
Apply local self-attention:

$A_{\mathrm{loc}}(Q, K_{B''}, V_{B''}) = \mathrm{softmax}((Q K_{B''}^\top)/{\sqrt{d}} + M) V_{B''}$

Gated Fusion:

For each head, fuse outputs:

$[\alpha_1, \alpha_2, \alpha_3] = \mathrm{softmax}([A_{\mathrm{glob}},A_{\mathrm{trans}},A_{\mathrm{loc}}] \cdot W_\mathrm{gate})$

$A = \alpha_1 A_{\mathrm{glob}} + \alpha_2 A_{\mathrm{trans}} + \alpha_3 A_{\mathrm{loc}}$

Multi-head outputs are concatenated and projected:

$E_S^{\mathrm{Evo}} = \mathrm{concat}(A^{(1)},…,A^{(H)}) W_O$

4. Composite Relative Temporal Encoding

The composite relative temporal encoding (“RelTemporal”) enhances time awareness in each attention branch via learnable, head-specific biases, capturing the following three temporal relations:

Relative Time (Bucketing):
- $\Delta t_{ij} = |t_i - t_j|$ , $b = \lfloor \log_2 \Delta t_{ij} \rfloor$
- $\mathrm{bias1}_{ij}^{(h)} = -b s_1^{(h)}$
- $s_1^{(h)}$ is learnable, initialized as a geometric sequence.
Relative Hour (Periodicity):
- $H_{ij} = ($ hour difference $) \bmod 24$
- $\mathrm{bias2}_{ij}^{(h)} = -\sin(\pi H_{ij}/24) s_2^{(h)}$
Relative Weekend (Weekday vs. Weekend):
- $W(t) = 1$ if weekend; $0$ else
- $\mathrm{bias3}_{ij}^{(h)} = 0$ if $W(t_i)=W(t_j)$ , $-1 \cdot s_3^{(h)}$ otherwise

Combined Bias:

$\mathrm{bias}_{ij}^{(h)} = \mathrm{bias1} + \mathrm{bias2} + \mathrm{bias3}$

The composite bias is incorporated into each attention computation:

$\mathrm{softmax}((QK^\top)/\sqrt{d} + \mathrm{bias}) V$

For global attention, $t_i, t_j$ denote average chunk timestamps.

5. Computational Complexity

SparseCTR achieves significant computational savings relative to full self-attention:

SparseCTR per block:

$O(B \cdot l \cdot [n P d + n m P d + n w d])$

Full self-attention:

$O(B \cdot l \cdot n^2 d)$

Here, $B$ is batch size, $l$ the number of SparseBlocks, $n$ the sequence length, $d$ the model dimension, $P$ chunk count, $m$ transition window size, and $w$ local window; with $P, m, w \ll n$ .

Parameter Count (excluding embeddings):

EvoAttention: $O(4l d^2 + 3l d)$
SwiGLU-FFN: $O(9l d^2)$

This sparsification enables handling industrial-scale sequences (e.g., length 1024) with constrained compute and memory budgets.

6. Empirical Evaluation

SparseCTR has been extensively validated on large-scale industrial datasets:

Industry (Meituan): 86.7M users, 20.2M items, 1.456B samples (2 years)
Alibaba Display Ads: 1.14M users, 461K items, 700M samples (22 days)
Ele.me click logs: 14.4M users, 7.45M items, 128M samples (30 days)

Baselines:

DIN, CAN (Group I); SoftSIM, HardSIM, ETA, TWIN-V2 (long-seq, Group II); BST, HSTU, LONGER, SUAN (attention-based, Group III).

Metrics: AUC and relative improvement (RelImpr over DIN).

Model	AUC (Industry)	RelImpr (%)
SparseCTR	0.7083 ±0.00004	8.49
SUAN (best baseline)	0.7040	6.25

Ablation Studies:

Removing EvoAttention for full self-attention reduces AUC (0.7083 → 0.7051).
Removing the global branch yields 0.7023.
Removing all temporal encodings yields 0.7015.
Each component contributes to state-of-the-art performance.

Comparative Analysis:

EvoAttention vs Dilated/NSAttention/KernelAttention: best AUC-speed trade-off.
TimeChunking outperforms SimChunking and Fixed-length chunking.
RelTemporal outperforms T5Bias, HSTUBias, Alibi, and RoPE.

Efficiency and Scaling Laws:

Varying $(P, m, w)$ delivers up to $8.9\times$ speedup over full self-attention, reducing GPU memory without sacrificing AUC.
AUC fits a power-law: $AUC(X) = E - A/X^\alpha$ , $R^2 \approx 1$ , demonstrating that higher FLOPs consistently yield better AUC across three orders of magnitude in FLOPs.

7. Deployment and Practical Considerations

Deployment Settings:

Embedding size $d = 32$ ; MLP: $[32 \rightarrow 1]$ with ReLU
$l = 2$ SparseBlock layers, $H = 8$ heads
Sequence lengths: $1024$ (Industry, Alibaba), $50$ (Ele.me)
Training: 1 epoch, Adam optimizer, NVIDIA A100-80G GPUs

Data Processing:

Offline data: listwise samples $(U, B, C, Y)$ per exposure, enabling simultaneous evaluation of all $|C|$ candidates per forward pass.

Inference and Online Performance:

Causal encoder and mask yield single-pass inference for all candidates; average latency $\approx$ 40ms.
Online A/B testing (1% traffic, 7 days): CTR increased by 1.72%, CPM by 1.41%, with no regression in latency or memory profile.

This suggests that the architecture is suitable for deployment in latency-sensitive industrial recommendation settings, providing both accuracy and resource efficiency (Lai et al., 25 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Unleashing the Potential of Sparse Attention on Long-term Behaviors for CTR Prediction (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SparseCTR.