Stacked Target-to-History Cross Attention
- The paper introduces STCA as an efficient attention mechanism that focuses on target-to-history interactions to achieve linear complexity.
- STCA replaces full self-attention with a stacked single-query multi-head cross attention, significantly lowering computational costs in long-sequence modeling.
- Empirical results show that STCA delivers comparable or superior ranking quality with up to 7.4× lower compute, making it ideal for large-scale recommender systems.
Stacked Target-to-History Cross Attention (STCA) is an attention mechanism introduced for industrial-scale long-sequence modeling, specifically designed to process extremely long user histories—on the order of 10,000 tokens—in real-time recommender systems operating under stringent latency and compute budgets. STCA substitutes conventional full self-attention over targets and history with a stacked strategy of single-query, multi-head cross-attention from target to history, achieving linear complexity with sequence length. This architectural innovation enables end-to-end modeling at a scale previously impractical for production settings, as demonstrated in deployment at full traffic on Douyin (Guan et al., 8 Nov 2025).
1. Motivation and Conceptual Foundations
Traditional recommender systems leveraging self-attention encounter prohibitive resource consumption when ingesting sequences of length . The per-layer FLOPs and memory usage for standard Transformers is , making them infeasible for low-latency, high-throughput applications. Empirically, the majority of utility for ranking a candidate target arises from its direct interactions with history items, rather than the inter-history dependencies modeled by self-attention. STCA addresses this by restricting intra-history modeling and focusing attention solely on target-to-history (T→H) interactions, yielding an attention block whose costs scale as , where is the embedding dimension and the number of heads. This tradeoff substantially reduces compute requirements while retaining the most relevant signal for target ranking.
2. Formal Definition and Layer Structure
A single STCA layer operates as follows:
- Notation:
- History embeddings
- Target embedding
- Number of heads , head dimension
- Parameters: ; projection
- Step 1: Input Encoding (Eq. 1)
with SwiGLU-FFN defined as
- Step 2: Multi-Head Target→History Cross Attention (per layer , head )
- Raw scores:
- Weights:
- Weighted sum:
- Output projection:
- Step 3: Layer Stacking and Target-Conditioned Query Fusion For layer , the query is constructed by fusing previous outputs and the raw target:
After layers, the history summary is
The final target-conditioned representation for prediction is
3. Computational Complexity and Scalability
STCA reduces per-layer complexity from quadratic in standard self-attention to linear . With the additional "reordering trick" for efficient computation (Eq. attn-optimized), the cost per head becomes
when . Empirical scaling (Fig. 5) demonstrates that, as history length increases from 500 to 10,000, STCA’s FLOPs grow by approximately , compared to for Transformer self-attention. In production, a 4-layer STCA model at (21 GFLOPs) achieves parity or superiority (in NLL) over a 4-layer Transformer at (156 GFLOPs), representing roughly lower compute for matched quality.
4. Implementation Details and System Integration
Typical deployment hyperparameters for STCA include embedding dimension , SwiGLU multilayer ratio , number of heads , and stack depth layers. In production, Request-Level Batching (RLB) further amplifies the efficiency gains by sharing user-history encoding across targets, leading to approximately host–device bandwidth reduction and end-to-end throughput uplift, and extending the feasible history length by at a fixed GPU RAM footprint.
Training follows a “train sparsely / infer densely” paradigm: stochastic training sequence lengths , with average ; inference at . Sampling uses a U-shaped beta distribution with , and the most recent tokens are always prioritized (suffix truncation). Sequence sparsity (SS) at achieves of full-length AUC gain for one-third of the compute.
5. Pseudocode and Workflow
The STCA forward pass is succinctly captured by the following pseudocode:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
q = LayerNorm(SwiGLUFFN^{(1)}(x_t)) # initial query
for i in 1..M:
X_i = LayerNorm(SwiGLUFFN^{(i)}(X)) # encode history
for r in 1..h:
qQ = q · W_Q^{(i,r)} # [1×d_h]
K = X_i · W_K^{(i,r)} # [L×d_h]
V = X_i · W_V^{(i,r)} # [L×d_h]
scores = qQ · K^T / sqrt(d_h) # [1×L]
α = softmax(scores) # [1×L]
o_r = α · V # [1×d_h]
o^{(i)} = concat(o_1,…,o_h) · W_O^{(i)} # [d]
if i < M:
concat_q = [o^{(1)}…o^{(i)}, x_t] # [(i+1)d]
q = SwiGLUFFN^{(i+1)}(concat_q · W_C^{(i+1)})
z = SwiGLUFFN([o^{(1)}…o^{(M)}, x_t] · W_Z)
return z |
This process yields a target-conditional history summary for subsequent prediction layers in the recommendation pipeline.
6. Empirical Results and Comparative Analysis
STCA demonstrates both efficiency and effectiveness in large-scale recommender deployments. In matched-compute offline experiments without retrieval features, STCA+RLB+Extrapolation achieves a +0.49% finish AUC lift and −1.16% NLL, compared to +0.31%/−0.86% for HSTU and +0.25%/−0.46% for standard Transformer architectures. Ablation studies indicate additional incremental benefits from deeper stacks (+0.18% AUC, 2→4 layers), SwiGLU activation (+0.11%), enlarged sparse ID embeddings (+0.08%), time-delta side information (+0.08%), increased head count (+0.05%), and enhanced query fusion (+0.06%). These findings elucidate the modular and extensible nature of STCA within end-to-end recommendation systems at billion-scale traffic.
7. Limitations and Theoretical Trade-Offs
While STCA prioritizes target–history interactions critical for ranking, the omission of history–history dependencies (modeled in full self-attention) constitutes an explicit trade-off. This is theoretically justified by the empirical observation that such dependencies contribute marginally once history length becomes large and only a limited portion of history is most relevant to the target. STCA’s linear scaling unlocks practical modeling of sequences at lengths previously considered infeasible for real-time systems, while experimental ablations confirm that the accuracy-cost tradeoff is favorable compared to existing alternatives.
In summary, STCA redefines the design landscape for long-sequence attention modules in recommender systems by enabling linear compute scaling, maintaining end-to-end differentiability, and matching Transformer accuracy at a small fraction of the cost. Its integration with Request Level Batching and length-extrapolative training further extends its capability for industrial-scale deployments with strict latency and resource constraints (Guan et al., 8 Nov 2025).