CTR-Sink: A Framework for CTR Prediction
- CTR-Sink is a framework that enhances CTR prediction by inserting explicit sink tokens to restructure user behavior logs and mitigate semantic fragmentation.
- It introduces a two-stage training process with sink-only and full-sequence stages to refocus attention for improved context aggregation.
- Empirical results show AUC improvements of up to 0.6% across diverse datasets, validating its practical impact in recommendation tasks.
Click-Through Rate Sink (CTR-Sink) is a framework designed to improve click-through rate (CTR) prediction in recommendation systems by addressing a core structural gap that arises when user behavior sequences are modeled as textual input for LMs. Unlike natural language text, user behavior logs consist of discrete, semantically unrelated items separated by minimal context, causing attention dispersion and degraded performance in conventional LMs. CTR-Sink introduces explicit attention sink tokens between user actions, regulated with recommendation-specific signals and a dedicated attention mechanism, to reconstruct coherent context modeling and facilitate information aggregation at behavior boundaries (Li et al., 5 Aug 2025).
1. Problem Motivation and Structural Disparity
CTR prediction tasks typically require estimating the probability that a user clicks on a recommended item, conditioned on prior behavioral data. Recent work has modeled user histories as textual sequences and utilized pre-trained LMs. However, while LMs are trained on coherent natural language—where certain tokens (e.g., conjunctions, punctuation) naturally function as attention sinks collecting context—behavioral sequences (e.g., "Item A, Item B, Item C, ...") lack such semantic structure. This leads to “semantic fragmentation”: LM attention is evenly—or even randomly—distributed among tokens, failing to recognize meaningful boundaries or relations between behaviors.
Visualization of self-attention patterns confirms this disparity. In natural language, attention converges on semantic pivots; in user behavior sequences, attention is diffuse or over-concentrated on a single position. Consequently, the LM’s inherent mechanisms for context aggregation fail to transfer, limiting the quality of learned representations for CTR tasks.
2. Attention Sink Theory and Formalization
The CTR-Sink approach draws on the attention sink theory, first developed in the context of long-sequence LMs, where special anchor tokens (sinks) serve to aggregate attention and control context propagation across long or structured sequences [Xiao et al., 2024; Gu et al., 2024 as cited in (Li et al., 5 Aug 2025)]. A sink token, by design, gathers attention from neighboring items, acting as a local information hub at natural “behavior” boundaries.
Formally, for a user’s top- retrieved behaviors , each with feature vector , sink tokens are interleaved as:
Each is parameterized via:
where denotes the temporal distance between and the target item. This signal provides recommendation context, differentiating true behavioral boundaries from random separators.
At the attention level, the standard Transformer attention matrix,
is supplemented by a sink-specific self-attention bias, constructed via:
- Extracting sink hidden states ,
- Computing self-attention among sinks:
0
where 1, 2,
- Scattering 3 into an 4 bias,
- Fusing:
5
This mechanism amplifies inter-sink dependencies to enhance behavioral correlation modeling.
3. CTR-Sink Model Architecture and Dataflow
CTR-Sink comprises three principal stages:
- Sequence Construction: User history is retrieved using an external semantic index (SIM), and the behavioral sequence is augmented by interleaving behavior tokens 6 with their corresponding 7.
- Attention Regulation: The extended sequence is processed by a frozen LM encoder (e.g., RoBERTa) or decoder (e.g., Qwen). At each layer, the sink-specific bias module aggregates sink interactions and injects this information into the standard attention computation.
- CTR Prediction Head: The final-layer representations of sink tokens, or a weighted combination of sink and non-sink tokens, are pooled and passed to a lightweight MLP to generate the predicted click probability 8.
This design allows the LM to restore contextual aggregation patterns characteristic of natural language, now at structured behavioral boundaries.
4. Training Regimen: Two-Stage Guidance
LM decoders typically over-attend to the initial token in the absence of special CLS-like tokens during pretraining. CTR-Sink introduces a two-stage training process to explicitly guide attention towards sink tokens:
- Stage 1 (Sink-only Training): Only 9 are provided as input. The model minimizes click prediction loss:
0
This forces the network to anchor its decision solely at sink positions.
- Stage 2 (Full Sequence Training): The full interleaved sequence is reintroduced, with the objective:
1
Learned patterns from Stage 1 persist, guiding attention distribution during training on complete sequences and correcting the over-focus on the initial token.
5. Experimental Evaluation, Baselines, and Results
CTR-Sink was validated on three datasets: a large-scale industrial dataset (Ant Group, Chinese, 46M exposures), MovieLens (English, 10M exposures), and KuaiRec (Chinese, 4M exposures). The Area Under the ROC Curve (AUC) was used as the principal metric, with 2AUC 3 treated as a significant difference in CTR tasks.
Benchmarks included standard ID-based DNN models (DeepFM, DIN, AutoInt, DCN-V2) and a vanilla LM-CTR baseline. The following table summarizes the AUC achieved across methods:
| Method | Industry | MovieLens | KuaiRec |
|---|---|---|---|
| DeepFM | 0.7678 | 0.7803 | 0.8072 |
| DIN | 0.7712 | 0.7785 | 0.8086 |
| AutoInt | 0.7735 | 0.7812 | 0.8064 |
| DCN-V2 | 0.7719 | 0.7806 | 0.8065 |
| LM-CTR (RoBERTa Baseline) | 0.7764 | 0.7808 | 0.8133 |
| +CTR-Sink | 0.7810★ | 0.7844★ | 0.8192★ |
| LM-CTR (Qwen Baseline) | 0.7885 | 0.8177 | 0.8156 |
| +CTR-Sink | 0.7919★ | 0.8203★ | 0.8198★ |
★ denotes statistical significance (4).
The introduction of CTR-Sink yields consistent AUC improvements of 0.2–0.6% across all tasks and architectures. Attention visualization demonstrates a substantial redistribution: over 40–50% of each layer’s attention mass is concentrated on sink tokens in the augmented LM, while the baseline exhibits diffuse or trivial peaks.
6. Ablation Studies and Quantitative Insights
Comprehensive ablation and sensitivity studies further elucidate the CTR-Sink design:
- External Signal Ablation: Replacing temporal distance features in sink tokens with random noise diminishes AUC gains from 50.4–0.6% to 60.02%, establishing the necessity of informative, recommendation-specific signals for effective sink anchoring.
- Sequence Length Sensitivity: On MovieLens, increasing 7 (the number of behavioral tokens) from 20 to 50 reveals saturation in vanilla LM-CTR, but +CTR-Sink exhibits monotonic AUC improvements (0.7674 8 0.7844). This suggests that sinks robustly combat semantic fragmentation in longer sequences.
- Training Epoch vs. Two-Stage Strategy: Matching two-stage training (6 epochs total) with ordinary longer training (also 6 epochs) shows that only the two-stage approach yields an additional 0.05–0.09% AUC, demonstrating that targeted attention guidance—rather than training duration alone—drives the gains.
- Inter-Sink Attention: Sink bias increases attention flow between sink positions, particularly in deep layers, quantitatively confirming that the model more effectively captures behavior-to-behavior dependencies.
7. Limitations and Prospects for Future Development
CTR-Sink introduces additional computational overhead by increasing sequence length and adding a per-layer self-attention bias module, potentially impacting latency in real-time systems and requiring careful engineering for deployment. The signal design for sinks is currently limited to temporal distance and semantic similarity, and could potentially be enhanced with features such as item popularity or user–item affinity. While the current framework operates on unimodal behavioral data, multi-modal extension (e.g., incorporating dwell time or visual interactions) represents a promising direction.
Scalability for ultra-long user histories (9) may necessitate integration with sparse attention or retrieval-augmented methods. A deeper theoretical understanding of sink emergence in LM pretraining, and optimal alignment with downstream recommendation objectives, remains an open area of research (Li et al., 5 Aug 2025).