LightSUAN CTR Prediction Model
- LightSUAN is a CTR prediction model that integrates sparse self-attention and parallel inference to optimize online performance.
- It employs a knowledge distillation strategy from the SUAN teacher model to maintain high predictive accuracy while meeting low latency requirements.
- Empirical results show improvements in CTR and CPM with minimal increases in inference time, proving its scalability for practical deployment.
LightSUAN is a click-through rate (CTR) prediction model designed for high-performance, low-latency online inference scenarios. Based on the SUAN (Stacked Unified Attention Network) architecture, LightSUAN leverages sparse attention and parallel inference optimizations combined with a knowledge distillation procedure to achieve the predictive power of large-scale attention-based models while conforming to the stringent latency requirements of online systems. Empirical deployment has demonstrated significant improvements in both CTR and revenue-related metrics when integrated at scale.
1. Architectural Foundations
LightSUAN is derived from SUAN, an attention-based model that adopts a modular structure centered on the Unified Attention Block (UAB). The overall workflow comprises the following components:
- Input Layer: User behavior sequences (augmented with candidate features to form a target-aware sequence), user profile features, and other features are embedded via a uniform embedding table. The target-aware sequence is represented as , with as sequence length, as feature count, and as embedding dimension. Profile features () and other features () are also separately embedded.
- Unified Attention Block (UAB):
- RMSNorm Pre-normalization: Computes .
- Self-attention Layer: With attention bias computed from relative time and position, i.e., , the attention output is:
with as linear projections of . - Adaptive Fusion Network (AFNet): Comprises a cross-attention layer (sequence queries profile embedding for cross-augmented sequence) and a dual alignment attention layer (employing fully-connected gating for fusing self/cross embeddings). - Feedforward Network: Applies a SwiGLU-activated feedforward block to finalize feature mixing.
Prediction Layer: The final candidate representation from the last UAB, flattened profile, and other features are concatenated and passed through an MLP to produce the probability estimate:
LightSUAN modifies this backbone to enhance online efficiency. The two critical adaptations are:
Sparse Self-Attention: Replaces full attention with a combination of local (window size ) and dilated self-attention (stride ), reducing runtime complexity.
Parallel Inference: Decouples behavior encoding from candidate-specific computations such that the user representation can be shared across candidates, enabling parallel scoring.
2. Scaling Laws for CTR Prediction
SUAN empirically demonstrates scaling laws analogous to those observed in LLMs, characterized by predictable performance improvements as model complexity and data size increase:
- Model Grade / Parameter Count: For fixed data, AUC scales as
where is the number of non-embedding parameters, the performance upper bound, and offsets the contribution of non-scalable layers.
- Sequence Length: Performance with respect to sequence length follows
- Data Size: Scaling to larger datasets ( samples) yields
Graphical analysis (see Figures 3a–3c in (Lai et al., 21 Aug 2025)) reveals high fits to these functional forms. This scaling suggests that LightSUAN, benefiting from SUAN via distillation, inherits the property that additional parameters, longer behavioral histories, and greater data volume deliver continual (albeit diminishing) AUC improvements.
3. Knowledge Distillation Strategy
To reconcile the accuracy-latency tradeoff, an online distillation procedure is employed whereby the high-capacity SUAN model serves as a teacher for the lightweight, deployable LightSUAN student:
Input Alignment: Both teacher (high-grade SUAN) and student (LightSUAN) process matched inputs; the teacher may use longer behavior sequences () than the student () for greater expressivity.
Temperature Scaling and Losses: Teacher and student logits (, ) are softened by temperature to produce probabilities and , respectively. The total loss is:
where cross-entropy terms guide both the teacher and the student, and the distillation term () ensures the student mimics the teacher's predictions. The weighting is set proportional to .
This process enables LightSUAN to "absorb" the advanced sequence modeling capacity of SUAN, yielding a model that approaches teacher-level performance while retaining optimized inference pathways.
4. Efficiency Mechanisms for Online Deployment
LightSUAN introduces two core algorithmic innovations ensuring online practicality:
Sparse Self-Attention: By combining local attention (window ) and dilated attention (stride ), computational complexity is reduced from to , enabling real-time processing of extended behavior sequences.
Parallel Inference: In real-world deployment, multiple candidate items must be scored for a user in parallel. LightSUAN's architecture decouples the history encoding from candidate features, allowing the user representation to be computed once and reused for all candidates per mini-batch, amortizing the computational cost.
These mechanisms, implemented atop the SUAN architecture, enable LightSUAN to deliver high throughput and meet latency budgets crucial for online personalized recommendation and advertising pipelines.
5. Empirical Performance and Business Impact
Systematic evaluation, both offline and in live A/B testing, demonstrates the impact of LightSUAN deployment:
Offline Metrics: SUAN (and by extension, LightSUAN) substantially outperforms established baselines such as DIN and CAN in AUC, across all tested sequence lengths and model configurations.
Online Outcomes:
- CTR: A 2.81% increase in click-through rate is recorded following integration of the distilled LightSUAN into an online service.
- CPM: Cost per mille rises by 1.69% under equivalent deployment.
- Latency: Average inference time remains within practical limits, increasing modestly from 33 ms to 43–48 ms.
The deployment leads to enhanced user engagement (as reflected by CTR) and improved efficiency of advertising revenue (as measured by CPM), all achieved without breaching inference latency constraints.
6. Contextual Significance and Practical Implications
LightSUAN operationalizes the scaling law paradigm from LLMs within the context of recommendation and advertising systems. By leveraging the scalable expressivity of SUAN and distilling it into an efficient model, the approach directly addresses the dual challenge of accuracy and speed endemic to online personalized services.
The successful integration of LightSUAN illustrates that attention-based, scaling law–oriented architectures adapted with algorithmic optimizations and distillation can be deployed at scale, yielding measurable improvements in both user engagement and business outcomes in production settings. The source code supporting implementation and further research is available at https://github.com/laiweijiang/SUAN.