Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 94 tok/s
Gemini 2.5 Pro 42 tok/s Pro
GPT-5 Medium 13 tok/s
GPT-5 High 17 tok/s Pro
GPT-4o 101 tok/s
GPT OSS 120B 460 tok/s Pro
Kimi K2 198 tok/s Pro
2000 character limit reached

LightSUAN CTR Prediction Model

Updated 24 August 2025
  • LightSUAN is a CTR prediction model that integrates sparse self-attention and parallel inference to optimize online performance.
  • It employs a knowledge distillation strategy from the SUAN teacher model to maintain high predictive accuracy while meeting low latency requirements.
  • Empirical results show improvements in CTR and CPM with minimal increases in inference time, proving its scalability for practical deployment.

LightSUAN is a click-through rate (CTR) prediction model designed for high-performance, low-latency online inference scenarios. Based on the SUAN (Stacked Unified Attention Network) architecture, LightSUAN leverages sparse attention and parallel inference optimizations combined with a knowledge distillation procedure to achieve the predictive power of large-scale attention-based models while conforming to the stringent latency requirements of online systems. Empirical deployment has demonstrated significant improvements in both CTR and revenue-related metrics when integrated at scale.

1. Architectural Foundations

LightSUAN is derived from SUAN, an attention-based model that adopts a modular structure centered on the Unified Attention Block (UAB). The overall workflow comprises the following components:

  • Input Layer: User behavior sequences (augmented with candidate features to form a target-aware sequence), user profile features, and other features are embedded via a uniform embedding table. The target-aware sequence is represented as EsRL×n1dE_s \in \mathbb{R}^{L \times n_1 d}, with LL as sequence length, n1n_1 as feature count, and dd as embedding dimension. Profile features (EpE_p) and other features (eothere_\text{other}) are also separately embedded.
  • Unified Attention Block (UAB):
    • RMSNorm Pre-normalization: Computes Enorm=RMSNorm(Es)E_\text{norm} = \mathrm{RMSNorm}(E_s).
    • Self-attention Layer: With attention bias computed from relative time and position, i.e., bias(ij)=f1(Δt(ij))+f2(Δp(ij))\text{bias}_{(ij)} = f_1(\Delta t_{(ij)}) + f_2(\Delta p_{(ij)}), the attention output is:

    Attention(Q1,K1,V1)=softmax(Q1K1n1d+bias)V1\mathrm{Attention}(Q_1, K_1, V_1) = \operatorname{softmax}\left(\frac{Q_1 K_1^\top}{\sqrt{n_1 d}} + \text{bias}\right)V_1

    with Q1,K1,V1Q_1, K_1, V_1 as linear projections of EnormE_\text{norm}. - Adaptive Fusion Network (AFNet): Comprises a cross-attention layer (sequence queries profile embedding for cross-augmented sequence) and a dual alignment attention layer (employing fully-connected gating for fusing self/cross embeddings). - Feedforward Network: Applies a SwiGLU-activated feedforward block to finalize feature mixing.

  • Prediction Layer: The final candidate representation from the last UAB, flattened profile, and other features are concatenated and passed through an MLP to produce the probability estimate:

y^=σ(z),z=MLP(Eblock[1,:],ep,eother)\hat{y} = \sigma(z), \quad z = \operatorname{MLP}(E_\text{block}[-1,:], e_p, e_\text{other})

LightSUAN modifies this backbone to enhance online efficiency. The two critical adaptations are:

  • Sparse Self-Attention: Replaces full attention with a combination of local (window size kk) and dilated self-attention (stride rr), reducing runtime complexity.

  • Parallel Inference: Decouples behavior encoding from candidate-specific computations such that the user representation can be shared across candidates, enabling parallel scoring.

2. Scaling Laws for CTR Prediction

SUAN empirically demonstrates scaling laws analogous to those observed in LLMs, characterized by predictable performance improvements as model complexity and data size increase:

  • Model Grade / Parameter Count: For fixed data, AUC scales as

AUC(C)=E1A1(CB1)α\mathrm{AUC}(C) = E_1 - \frac{A_1}{(C-B_1)^\alpha}

where CC is the number of non-embedding parameters, E1E_1 the performance upper bound, and B1B_1 offsets the contribution of non-scalable layers.

  • Sequence Length: Performance with respect to sequence length LL follows

AUC(L)=E2A2Lβ\mathrm{AUC}(L) = E_2 - \frac{A_2}{L^\beta}

  • Data Size: Scaling to larger datasets (DD samples) yields

AUC(D)=E3A3Dγ\mathrm{AUC}(D) = E_3 - \frac{A_3}{D^\gamma}

Graphical analysis (see Figures 3a–3c in (Lai et al., 21 Aug 2025)) reveals high R2R^2 fits to these functional forms. This scaling suggests that LightSUAN, benefiting from SUAN via distillation, inherits the property that additional parameters, longer behavioral histories, and greater data volume deliver continual (albeit diminishing) AUC improvements.

3. Knowledge Distillation Strategy

To reconcile the accuracy-latency tradeoff, an online distillation procedure is employed whereby the high-capacity SUAN model serves as a teacher for the lightweight, deployable LightSUAN student:

  • Input Alignment: Both teacher (high-grade SUAN) and student (LightSUAN) process matched inputs; the teacher may use longer behavior sequences (SS') than the student (SS) for greater expressivity.

  • Temperature Scaling and Losses: Teacher and student logits (zz, zz') are softened by temperature tt to produce probabilities y^t=σ(z/t)\hat{y}_t = \sigma(z/t) and y^t=σ(z/t){\hat{y}'_t} = \sigma(z'/t), respectively. The total loss is:

loss=Lce(y^,y)+Lce(y^,y)+λLce(y^t,y^t)\text{loss} = L_\text{ce}(\hat{y}, y) + L_\text{ce}(\hat{y}', y) + \lambda \cdot L_\text{ce}(\hat{y}_t, {\hat{y}'_t})

where cross-entropy terms guide both the teacher and the student, and the distillation term (Lce(y^t,y^t)L_\text{ce}(\hat{y}_t, {\hat{y}'_t})) ensures the student mimics the teacher's predictions. The weighting λ\lambda is set proportional to tt.

This process enables LightSUAN to "absorb" the advanced sequence modeling capacity of SUAN, yielding a model that approaches teacher-level performance while retaining optimized inference pathways.

4. Efficiency Mechanisms for Online Deployment

LightSUAN introduces two core algorithmic innovations ensuring online practicality:

  • Sparse Self-Attention: By combining local attention (window kk) and dilated attention (stride rr), computational complexity is reduced from O(L2)O(L^2) to O(L(k+L/r))O(L \cdot(k+L/r)), enabling real-time processing of extended behavior sequences.

  • Parallel Inference: In real-world deployment, multiple candidate items must be scored for a user in parallel. LightSUAN's architecture decouples the history encoding from candidate features, allowing the user representation to be computed once and reused for all m2m_2 candidates per mini-batch, amortizing the computational cost.

These mechanisms, implemented atop the SUAN architecture, enable LightSUAN to deliver high throughput and meet latency budgets crucial for online personalized recommendation and advertising pipelines.

5. Empirical Performance and Business Impact

Systematic evaluation, both offline and in live A/B testing, demonstrates the impact of LightSUAN deployment:

  • Offline Metrics: SUAN (and by extension, LightSUAN) substantially outperforms established baselines such as DIN and CAN in AUC, across all tested sequence lengths and model configurations.

  • Online Outcomes:

    • CTR: A 2.81% increase in click-through rate is recorded following integration of the distilled LightSUAN into an online service.
    • CPM: Cost per mille rises by 1.69% under equivalent deployment.
    • Latency: Average inference time remains within practical limits, increasing modestly from 33 ms to 43–48 ms.

The deployment leads to enhanced user engagement (as reflected by CTR) and improved efficiency of advertising revenue (as measured by CPM), all achieved without breaching inference latency constraints.

6. Contextual Significance and Practical Implications

LightSUAN operationalizes the scaling law paradigm from LLMs within the context of recommendation and advertising systems. By leveraging the scalable expressivity of SUAN and distilling it into an efficient model, the approach directly addresses the dual challenge of accuracy and speed endemic to online personalized services.

The successful integration of LightSUAN illustrates that attention-based, scaling law–oriented architectures adapted with algorithmic optimizations and distillation can be deployed at scale, yielding measurable improvements in both user engagement and business outcomes in production settings. The source code supporting implementation and further research is available at https://github.com/laiweijiang/SUAN.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube