Stacked Target-to-History Cross Attention

Updated 15 November 2025

The paper introduces STCA as an efficient attention mechanism that focuses on target-to-history interactions to achieve linear complexity.
STCA replaces full self-attention with a stacked single-query multi-head cross attention, significantly lowering computational costs in long-sequence modeling.
Empirical results show that STCA delivers comparable or superior ranking quality with up to 7.4× lower compute, making it ideal for large-scale recommender systems.

Stacked Target-to-History Cross Attention (STCA) is an attention mechanism introduced for industrial-scale long-sequence modeling, specifically designed to process extremely long user histories—on the order of 10,000 tokens—in real-time recommender systems operating under stringent latency and compute budgets. STCA substitutes conventional full self-attention over targets and history with a stacked strategy of single-query, multi-head cross-attention from target to history, achieving linear complexity with sequence length. This architectural innovation enables end-to-end modeling at a scale previously impractical for production settings, as demonstrated in deployment at full traffic on Douyin (Guan et al., 8 Nov 2025).

1. Motivation and Conceptual Foundations

Traditional recommender systems leveraging self-attention encounter prohibitive resource consumption when ingesting sequences of length $L \sim 10,000$ . The per-layer FLOPs and memory usage for standard Transformers is $O(L^2 d)$ , making them infeasible for low-latency, high-throughput applications. Empirically, the majority of utility for ranking a candidate target $t$ arises from its direct interactions with history items, rather than the inter-history dependencies modeled by self-attention. STCA addresses this by restricting intra-history modeling and focusing attention solely on target-to-history (T→H) interactions, yielding an attention block whose costs scale as $O(L d h)$ , where $d$ is the embedding dimension and $h$ the number of heads. This tradeoff substantially reduces compute requirements while retaining the most relevant signal for target ranking.

2. Formal Definition and Layer Structure

A single STCA layer operates as follows:

Notation:
- History embeddings $\mathcal H = \{x_1, \ldots, x_L\},\; x_j \in \mathbb R^d$
- Target embedding $x_t \in \mathbb R^d$
- Number of heads $h$ , head dimension $d_h = d/h$
- Parameters: $W_Q^{(i,r)}, W_K^{(i,r)}, W_V^{(i,r)} \in \mathbb R^{d \times d_h}$ ; projection $W_O^{(i)} \in \mathbb R^{d\times d}$
Step 1: Input Encoding (Eq. 1)

$\widetilde X^{(i)} = \mathrm{LN}\bigl(\mathrm{SwiGLUFFN}^{(i)}(X)\bigr) \in \mathbb{R}^{L \times d}$

$q^{(1)} = \mathrm{LN}\bigl(\mathrm{SwiGLUFFN}^{(1)}(x_t)\bigr) \in \mathbb{R}^d$

with SwiGLU-FFN defined as

$\mathrm{SwiGLUFFN}(z) = \big((zW_u)\odot\mathrm{SiLU}(zW_v)\big)W_o$

Step 2: Multi-Head Target→History Cross Attention (per layer $i$ , head $r$ )
- Raw scores:
$s^{(i,r)} = \frac{(q^{(i)}W_Q^{(i,r)})\,(\widetilde X^{(i)} W_K^{(i,r)})^\top}{\sqrt{d_h}} \in \mathbb{R}^{1 \times L}$ - Weights:

$\alpha^{(i,r)} = \mathrm{softmax}(s^{(i,r)}) \in \mathbb{R}^{1 \times L}$ - Weighted sum:

$o^{(i,r)} = \alpha^{(i,r)} \; (\widetilde X^{(i)} W_V^{(i,r)}) \in \mathbb{R}^{1 \times d_h}$ - Output projection:

$o^{(i)} = [o^{(i,1)} \;\|\; \cdots \;\|\; o^{(i,h)}] W_O^{(i)} \in \mathbb{R}^{d}$
Step 3: Layer Stacking and Target-Conditioned Query Fusion For layer $i+1$ , the query is constructed by fusing previous outputs and the raw target:

$q^{(i+1)} = \mathrm{SwiGLUFFN}^{(i+1)}\big([o^{(1)} \;\|\; \ldots \;\|\; o^{(i)} \;\|\; x_t] W_C^{(i+1)}\big)$

After $M$ layers, the history summary is

$Z_\mathcal{H} = \begin{bmatrix} o^{(1)} \ \vdots \ o^{(M)} \end{bmatrix} \in \mathbb{R}^{M \times d}$

The final target-conditioned representation for prediction is

$z = \mathrm{SwiGLUFFN}([o^{(1)} \;\|\; \cdots \;\|\; o^{(M)} \;\|\; x_t] W_Z)$

3. Computational Complexity and Scalability

STCA reduces per-layer complexity from quadratic $O(L^2 d)$ in standard self-attention to linear $O(L d h)$ . With the additional "reordering trick" for efficient computation (Eq. attn-optimized), the cost per head becomes

$O(d\,d_h) + O(Ld) + O(d\,d_h) = O(Ld + h d^2) \approx O(L d h)$

when $L \gg d$ . Empirical scaling (Fig. 5) demonstrates that, as history length increases from 500 to 10,000, STCA’s FLOPs grow by approximately $20\times$ , compared to $114\times$ for Transformer self-attention. In production, a 4-layer STCA model at $L=10,000$ (21 GFLOPs) achieves parity or superiority (in NLL) over a 4-layer Transformer at $L=8,000$ (156 GFLOPs), representing roughly $7.4\times$ lower compute for matched quality.

4. Implementation Details and System Integration

Typical deployment hyperparameters for STCA include embedding dimension $d=256$ , SwiGLU multilayer ratio $r=4$ , number of heads $h=8$ , and stack depth $M=4$ layers. In production, Request-Level Batching (RLB) further amplifies the efficiency gains by sharing user-history encoding across $m=8$ targets, leading to approximately $77\%-84\%$ host–device bandwidth reduction and $2.2\times-5.1\times$ end-to-end throughput uplift, and extending the feasible history length by $8\times$ at a fixed GPU RAM footprint.

Training follows a “train sparsely / infer densely” paradigm: stochastic training sequence lengths $L_{\mathrm{train}} \in [L_{\min}, L_{\max}]$ , with average $L_{\mathrm{avg}} \approx 2,000$ ; inference at $L_{\mathrm{infer}} = 10,000$ . Sampling uses a U-shaped beta distribution with $\alpha \approx 0.02$ , and the most recent tokens are always prioritized (suffix truncation). Sequence sparsity (SS) at $20\%$ achieves $\sim 80\%$ of full-length AUC gain for one-third of the compute.

5. Pseudocode and Workflow

The STCA forward pass is succinctly captured by the following pseudocode:

q = LayerNorm(SwiGLUFFN^{(1)}(x_t))  # initial query
for i in 1..M:
    X_i = LayerNorm(SwiGLUFFN^{(i)}(X))  # encode history
    for r in 1..h:
        qQ = q · W_Q^{(i,r)}             # [1×d_h]
        K  = X_i · W_K^{(i,r)}           # [L×d_h]
        V  = X_i · W_V^{(i,r)}           # [L×d_h]
        scores = qQ · K^T / sqrt(d_h)    # [1×L]
        α = softmax(scores)              # [1×L]
        o_r = α · V                      # [1×d_h]
    o^{(i)} = concat(o_1,…,o_h) · W_O^{(i)}  # [d]
    if i < M:
        concat_q = [o^{(1)}…o^{(i)}, x_t]    # [(i+1)d]
        q = SwiGLUFFN^{(i+1)}(concat_q · W_C^{(i+1)})
z = SwiGLUFFN([o^{(1)}…o^{(M)}, x_t] · W_Z)
return z

This process yields a target-conditional history summary for subsequent prediction layers in the recommendation pipeline.

6. Empirical Results and Comparative Analysis

STCA demonstrates both efficiency and effectiveness in large-scale recommender deployments. In matched-compute offline experiments without retrieval features, STCA+RLB+Extrapolation achieves a +0.49% finish AUC lift and −1.16% NLL, compared to +0.31%/−0.86% for HSTU and +0.25%/−0.46% for standard Transformer architectures. Ablation studies indicate additional incremental benefits from deeper stacks (+0.18% AUC, 2→4 layers), SwiGLU activation (+0.11%), enlarged sparse ID embeddings (+0.08%), time-delta side information (+0.08%), increased head count (+0.05%), and enhanced query fusion (+0.06%). These findings elucidate the modular and extensible nature of STCA within end-to-end recommendation systems at billion-scale traffic.

7. Limitations and Theoretical Trade-Offs

While STCA prioritizes target–history interactions critical for ranking, the omission of history–history dependencies (modeled in full self-attention) constitutes an explicit trade-off. This is theoretically justified by the empirical observation that such dependencies contribute marginally once history length becomes large and only a limited portion of history is most relevant to the target. STCA’s linear scaling unlocks practical modeling of sequences at lengths previously considered infeasible for real-time systems, while experimental ablations confirm that the accuracy-cost tradeoff is favorable compared to existing alternatives.

In summary, STCA redefines the design landscape for long-sequence attention modules in recommender systems by enabling linear compute scaling, maintaining end-to-end differentiability, and matching Transformer accuracy at a small fraction of the cost. Its integration with Request Level Batching and length-extrapolative training further extends its capability for industrial-scale deployments with strict latency and resource constraints (Guan et al., 8 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

Make It Long, Keep It Fast: End-to-End 10k-Sequence Modeling at Billion Scale on Douyin (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Stacked Target-to-History Cross Attention (STCA).