SparseCTR: Scalable CTR Prediction
- SparseCTR is a CTR prediction framework that models long-term user behavior using personalized time-aware chunking and sparse self-attention.
- It employs a three-branch EvoAttention mechanism to efficiently capture global, transition, and local interests in large-scale datasets.
- The design achieves significant speedup and accuracy improvements while managing computational complexity in industrial recommender systems.
SparseCTR is a CTR prediction framework designed for efficient and effective modeling of long-term user behaviors in large-scale industrial recommender systems. It addresses the prohibitive computational complexity associated with conventional full self-attention architectures when applied to long user behavior sequences, while capturing domain-specific personalization and temporal dynamics. SparseCTR introduces a personalized time-aware chunking algorithm combined with a three-branch sparse self-attention module and head-wise composite relative temporal encoding, achieving improved accuracy and computational efficiency across several industry datasets (Lai et al., 25 Jan 2026).
1. Model Architecture
The architecture of SparseCTR consists of an input stage, multiple stacked SparseBlock layers, and a prediction layer.
Input Representation:
- For each user, profile features , historical behaviors (with item embeddings and timestamps), and candidate items are collected.
- All are concatenated into a mixed sequence , , and profile features are embedded into .
- A composite mask ensures causality (position only attends to ) and candidate isolation (candidates do not attend to each other).
SparseBlock Layer (repeated times):
- Pre-normalization is performed with RMSNorm.
- EvoAttention with composite relative temporal bias is applied.
- SwiGLU-Feedforward network:
where , , .
Prediction Layer:
- Candidate embeddings are fused with and passed through an MLP for CTR prediction:
- The model is optimized using binary cross-entropy loss on all candidates.
2. Personalized Time-Aware Chunking
A key innovation in SparseCTR is the personalized time-aware chunking (“TimeChunking”) algorithm, which segments long behavior sequences based on the largest time-gaps to preserve the continuity of user actions and facilitate parallelism.
Procedure:
- Compute time differences , .
- Select the top- largest time-gaps as chunk boundaries.
- Define variable-length chunks , each containing consecutive behaviors between boundaries.
| Step | Operation | Output |
|---|---|---|
| 1 | computation on behaviors | Length- array |
| 2 | Select largest indices in | Chunk boundaries indices |
| 3 | Sort, pad with 0 and , split accordingly | chunks |
This chunking ensures that sequences with abrupt temporal gaps are split appropriately and continuous interaction patterns are maintained.
3. Three-Branch EvoAttention Mechanism
SparseCTR's EvoAttention module sparsifies self-attention by jointly modeling three aspects of user behavior:
Global Attention (long-term interests):
- For each chunk , aggregate its behaviors via an MLP:
Then attend globally:
Transition Attention (interest shifts):
- For each chunk, use the last behaviors (), aggregate their key-value pairs, and compute attention:
Local Attention (short-term interests and user profile):
- For each position , local window .
- Apply local self-attention:
Gated Fusion:
- For each head, fuse outputs:
Multi-head outputs are concatenated and projected:
4. Composite Relative Temporal Encoding
The composite relative temporal encoding (“RelTemporal”) enhances time awareness in each attention branch via learnable, head-specific biases, capturing the following three temporal relations:
- Relative Time (Bucketing):
- ,
- is learnable, initialized as a geometric sequence.
- Relative Hour (Periodicity):
- hour difference
- Relative Weekend (Weekday vs. Weekend):
- if weekend; $0$ else
- if , otherwise
Combined Bias:
The composite bias is incorporated into each attention computation:
For global attention, denote average chunk timestamps.
5. Computational Complexity
SparseCTR achieves significant computational savings relative to full self-attention:
- SparseCTR per block:
- Full self-attention:
Here, is batch size, the number of SparseBlocks, the sequence length, the model dimension, chunk count, transition window size, and local window; with .
Parameter Count (excluding embeddings):
- EvoAttention:
- SwiGLU-FFN:
This sparsification enables handling industrial-scale sequences (e.g., length 1024) with constrained compute and memory budgets.
6. Empirical Evaluation
SparseCTR has been extensively validated on large-scale industrial datasets:
- Industry (Meituan): 86.7M users, 20.2M items, 1.456B samples (2 years)
- Alibaba Display Ads: 1.14M users, 461K items, 700M samples (22 days)
- Ele.me click logs: 14.4M users, 7.45M items, 128M samples (30 days)
Baselines:
- DIN, CAN (Group I); SoftSIM, HardSIM, ETA, TWIN-V2 (long-seq, Group II); BST, HSTU, LONGER, SUAN (attention-based, Group III).
Metrics: AUC and relative improvement (RelImpr over DIN).
| Model | AUC (Industry) | RelImpr (%) |
|---|---|---|
| SparseCTR | 0.7083 ±0.00004 | 8.49 |
| SUAN (best baseline) | 0.7040 | 6.25 |
Ablation Studies:
- Removing EvoAttention for full self-attention reduces AUC (0.7083 → 0.7051).
- Removing the global branch yields 0.7023.
- Removing all temporal encodings yields 0.7015.
- Each component contributes to state-of-the-art performance.
Comparative Analysis:
- EvoAttention vs Dilated/NSAttention/KernelAttention: best AUC-speed trade-off.
- TimeChunking outperforms SimChunking and Fixed-length chunking.
- RelTemporal outperforms T5Bias, HSTUBias, Alibi, and RoPE.
Efficiency and Scaling Laws:
- Varying delivers up to speedup over full self-attention, reducing GPU memory without sacrificing AUC.
- AUC fits a power-law: , , demonstrating that higher FLOPs consistently yield better AUC across three orders of magnitude in FLOPs.
7. Deployment and Practical Considerations
Deployment Settings:
- Embedding size ; MLP: with ReLU
- SparseBlock layers, heads
- Sequence lengths: $1024$ (Industry, Alibaba), $50$ (Ele.me)
- Training: 1 epoch, Adam optimizer, NVIDIA A100-80G GPUs
Data Processing:
- Offline data: listwise samples per exposure, enabling simultaneous evaluation of all candidates per forward pass.
Inference and Online Performance:
- Causal encoder and mask yield single-pass inference for all candidates; average latency 40ms.
- Online A/B testing (1% traffic, 7 days): CTR increased by 1.72%, CPM by 1.41%, with no regression in latency or memory profile.
This suggests that the architecture is suitable for deployment in latency-sensitive industrial recommendation settings, providing both accuracy and resource efficiency (Lai et al., 25 Jan 2026).