Papers
Topics
Authors
Recent
2000 character limit reached

Stacked Target-to-History Cross Attention

Updated 15 November 2025
  • The paper introduces STCA as an efficient attention mechanism that focuses on target-to-history interactions to achieve linear complexity.
  • STCA replaces full self-attention with a stacked single-query multi-head cross attention, significantly lowering computational costs in long-sequence modeling.
  • Empirical results show that STCA delivers comparable or superior ranking quality with up to 7.4× lower compute, making it ideal for large-scale recommender systems.

Stacked Target-to-History Cross Attention (STCA) is an attention mechanism introduced for industrial-scale long-sequence modeling, specifically designed to process extremely long user histories—on the order of 10,000 tokens—in real-time recommender systems operating under stringent latency and compute budgets. STCA substitutes conventional full self-attention over targets and history with a stacked strategy of single-query, multi-head cross-attention from target to history, achieving linear complexity with sequence length. This architectural innovation enables end-to-end modeling at a scale previously impractical for production settings, as demonstrated in deployment at full traffic on Douyin (Guan et al., 8 Nov 2025).

1. Motivation and Conceptual Foundations

Traditional recommender systems leveraging self-attention encounter prohibitive resource consumption when ingesting sequences of length L10,000L \sim 10,000. The per-layer FLOPs and memory usage for standard Transformers is O(L2d)O(L^2 d), making them infeasible for low-latency, high-throughput applications. Empirically, the majority of utility for ranking a candidate target tt arises from its direct interactions with history items, rather than the inter-history dependencies modeled by self-attention. STCA addresses this by restricting intra-history modeling and focusing attention solely on target-to-history (T→H) interactions, yielding an attention block whose costs scale as O(Ldh)O(L d h), where dd is the embedding dimension and hh the number of heads. This tradeoff substantially reduces compute requirements while retaining the most relevant signal for target ranking.

2. Formal Definition and Layer Structure

A single STCA layer operates as follows:

  • Notation:
    • History embeddings H={x1,,xL},  xjRd\mathcal H = \{x_1, \ldots, x_L\},\; x_j \in \mathbb R^d
    • Target embedding xtRdx_t \in \mathbb R^d
    • Number of heads hh, head dimension dh=d/hd_h = d/h
    • Parameters: WQ(i,r),WK(i,r),WV(i,r)Rd×dhW_Q^{(i,r)}, W_K^{(i,r)}, W_V^{(i,r)} \in \mathbb R^{d \times d_h}; projection WO(i)Rd×dW_O^{(i)} \in \mathbb R^{d\times d}
  • Step 1: Input Encoding (Eq. 1)

X~(i)=LN(SwiGLUFFN(i)(X))RL×d\widetilde X^{(i)} = \mathrm{LN}\bigl(\mathrm{SwiGLUFFN}^{(i)}(X)\bigr) \in \mathbb{R}^{L \times d}

q(1)=LN(SwiGLUFFN(1)(xt))Rdq^{(1)} = \mathrm{LN}\bigl(\mathrm{SwiGLUFFN}^{(1)}(x_t)\bigr) \in \mathbb{R}^d

with SwiGLU-FFN defined as

SwiGLUFFN(z)=((zWu)SiLU(zWv))Wo\mathrm{SwiGLUFFN}(z) = \big((zW_u)\odot\mathrm{SiLU}(zW_v)\big)W_o

  • Step 2: Multi-Head Target→History Cross Attention (per layer ii, head rr)

    • Raw scores:

    s(i,r)=(q(i)WQ(i,r))(X~(i)WK(i,r))dhR1×Ls^{(i,r)} = \frac{(q^{(i)}W_Q^{(i,r)})\,(\widetilde X^{(i)} W_K^{(i,r)})^\top}{\sqrt{d_h}} \in \mathbb{R}^{1 \times L} - Weights:

    α(i,r)=softmax(s(i,r))R1×L\alpha^{(i,r)} = \mathrm{softmax}(s^{(i,r)}) \in \mathbb{R}^{1 \times L} - Weighted sum:

    o(i,r)=α(i,r)  (X~(i)WV(i,r))R1×dho^{(i,r)} = \alpha^{(i,r)} \; (\widetilde X^{(i)} W_V^{(i,r)}) \in \mathbb{R}^{1 \times d_h} - Output projection:

    o(i)=[o(i,1)        o(i,h)]WO(i)Rdo^{(i)} = [o^{(i,1)} \;\|\; \cdots \;\|\; o^{(i,h)}] W_O^{(i)} \in \mathbb{R}^{d}

  • Step 3: Layer Stacking and Target-Conditioned Query Fusion For layer i+1i+1, the query is constructed by fusing previous outputs and the raw target:

q(i+1)=SwiGLUFFN(i+1)([o(1)        o(i)    xt]WC(i+1))q^{(i+1)} = \mathrm{SwiGLUFFN}^{(i+1)}\big([o^{(1)} \;\|\; \ldots \;\|\; o^{(i)} \;\|\; x_t] W_C^{(i+1)}\big)

After MM layers, the history summary is

ZH=[o(1)  o(M)]RM×dZ_\mathcal{H} = \begin{bmatrix} o^{(1)} \ \vdots \ o^{(M)} \end{bmatrix} \in \mathbb{R}^{M \times d}

The final target-conditioned representation for prediction is

z=SwiGLUFFN([o(1)        o(M)    xt]WZ)z = \mathrm{SwiGLUFFN}([o^{(1)} \;\|\; \cdots \;\|\; o^{(M)} \;\|\; x_t] W_Z)

3. Computational Complexity and Scalability

STCA reduces per-layer complexity from quadratic O(L2d)O(L^2 d) in standard self-attention to linear O(Ldh)O(L d h). With the additional "reordering trick" for efficient computation (Eq. attn-optimized), the cost per head becomes

O(ddh)+O(Ld)+O(ddh)=O(Ld+hd2)O(Ldh)O(d\,d_h) + O(Ld) + O(d\,d_h) = O(Ld + h d^2) \approx O(L d h)

when LdL \gg d. Empirical scaling (Fig. 5) demonstrates that, as history length increases from 500 to 10,000, STCA’s FLOPs grow by approximately 20×20\times, compared to 114×114\times for Transformer self-attention. In production, a 4-layer STCA model at L=10,000L=10,000 (21 GFLOPs) achieves parity or superiority (in NLL) over a 4-layer Transformer at L=8,000L=8,000 (156 GFLOPs), representing roughly 7.4×7.4\times lower compute for matched quality.

4. Implementation Details and System Integration

Typical deployment hyperparameters for STCA include embedding dimension d=256d=256, SwiGLU multilayer ratio r=4r=4, number of heads h=8h=8, and stack depth M=4M=4 layers. In production, Request-Level Batching (RLB) further amplifies the efficiency gains by sharing user-history encoding across m=8m=8 targets, leading to approximately 77%84%77\%-84\% host–device bandwidth reduction and 2.2×5.1×2.2\times-5.1\times end-to-end throughput uplift, and extending the feasible history length by 8×8\times at a fixed GPU RAM footprint.

Training follows a “train sparsely / infer densely” paradigm: stochastic training sequence lengths Ltrain[Lmin,Lmax]L_{\mathrm{train}} \in [L_{\min}, L_{\max}], with average Lavg2,000L_{\mathrm{avg}} \approx 2,000; inference at Linfer=10,000L_{\mathrm{infer}} = 10,000. Sampling uses a U-shaped beta distribution with α0.02\alpha \approx 0.02, and the most recent tokens are always prioritized (suffix truncation). Sequence sparsity (SS) at 20%20\% achieves 80%\sim 80\% of full-length AUC gain for one-third of the compute.

5. Pseudocode and Workflow

The STCA forward pass is succinctly captured by the following pseudocode:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
q = LayerNorm(SwiGLUFFN^{(1)}(x_t))  # initial query
for i in 1..M:
    X_i = LayerNorm(SwiGLUFFN^{(i)}(X))  # encode history
    for r in 1..h:
        qQ = q · W_Q^{(i,r)}             # [1×d_h]
        K  = X_i · W_K^{(i,r)}           # [L×d_h]
        V  = X_i · W_V^{(i,r)}           # [L×d_h]
        scores = qQ · K^T / sqrt(d_h)    # [1×L]
        α = softmax(scores)              # [1×L]
        o_r = α · V                      # [1×d_h]
    o^{(i)} = concat(o_1,,o_h) · W_O^{(i)}  # [d]
    if i < M:
        concat_q = [o^{(1)}o^{(i)}, x_t]    # [(i+1)d]
        q = SwiGLUFFN^{(i+1)}(concat_q · W_C^{(i+1)})
z = SwiGLUFFN([o^{(1)}o^{(M)}, x_t] · W_Z)
return z

This process yields a target-conditional history summary for subsequent prediction layers in the recommendation pipeline.

6. Empirical Results and Comparative Analysis

STCA demonstrates both efficiency and effectiveness in large-scale recommender deployments. In matched-compute offline experiments without retrieval features, STCA+RLB+Extrapolation achieves a +0.49% finish AUC lift and −1.16% NLL, compared to +0.31%/−0.86% for HSTU and +0.25%/−0.46% for standard Transformer architectures. Ablation studies indicate additional incremental benefits from deeper stacks (+0.18% AUC, 2→4 layers), SwiGLU activation (+0.11%), enlarged sparse ID embeddings (+0.08%), time-delta side information (+0.08%), increased head count (+0.05%), and enhanced query fusion (+0.06%). These findings elucidate the modular and extensible nature of STCA within end-to-end recommendation systems at billion-scale traffic.

7. Limitations and Theoretical Trade-Offs

While STCA prioritizes target–history interactions critical for ranking, the omission of history–history dependencies (modeled in full self-attention) constitutes an explicit trade-off. This is theoretically justified by the empirical observation that such dependencies contribute marginally once history length becomes large and only a limited portion of history is most relevant to the target. STCA’s linear scaling unlocks practical modeling of sequences at lengths previously considered infeasible for real-time systems, while experimental ablations confirm that the accuracy-cost tradeoff is favorable compared to existing alternatives.

In summary, STCA redefines the design landscape for long-sequence attention modules in recommender systems by enabling linear compute scaling, maintaining end-to-end differentiability, and matching Transformer accuracy at a small fraction of the cost. Its integration with Request Level Batching and length-extrapolative training further extends its capability for industrial-scale deployments with strict latency and resource constraints (Guan et al., 8 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Stacked Target-to-History Cross Attention (STCA).