The paper introduces Generative Recommenders (GRs) that reformulate recommendation challenges as sequential transduction tasks. A new architecture called Hierarchical Sequential Transduction Units (HSTU) is proposed, which is designed for high cardinality and non-stationary streaming recommendation data. The paper demonstrates that HSTU outperforms existing Deep Learning Recommendation Models (DLRMs) and Transformers, achieving significant speedups and quality improvements in both offline and online settings. The key contributions and findings of the paper are summarized below.
Generative Recommenders (GRs)
- The paper introduces a new paradigm called Generative Recommenders (GRs), which replaces traditional DLRMs.
- GRs unify heterogeneous feature spaces in DLRMs by consolidating categorical and numerical features into a single time series. Numerical features are removed in GRs, assuming that a sufficiently expressive sequential transduction architecture can capture them as sequence length and compute increase.
- The authors reformulate ranking and retrieval tasks as sequential transduction tasks, enabling model training in a sequential, generative fashion.
- Generative training amortizes encoder costs across multiple targets, reducing computational complexity.
Hierarchical Sequential Transduction Units (HSTU)
- A new sequential transduction architecture, Hierarchical Sequential Transduction Units (HSTU), is proposed to address computational cost challenges during training and inference.
- HSTU modifies the attention mechanism for large, non-stationary vocabularies and exploits characteristics of recommendation datasets to achieve a blue{5.3x to 15.2x speedup vs FlashAttention2-based Transformers on 8192 length sequences}.
- HSTU comprises Pointwise Projection (Equation 1), Spatial Aggregation (Equation 2), and Pointwise Transformation (Equation 3) sub-layers:
-
- : Input
- : Gating weights
- : Values
- : Queries
- : Keys
- : MLP (one linear layer)
- : nonlinearity (SiLU)
-
- : Attention weights
- : Relative attention bias incorporating positional () and temporal () information
- : nonlinearity (SiLU)
-
- : Output
- : Layer norm
- : MLP (one linear layer)
-
- A new pointwise aggregated attention mechanism is adopted, where the layer norm is needed after pointwise pooling to stabilize training.
- The architecture leverages and algorithmically increases sparsity via Stochastic Length (SL), reducing encoder cost.
- Sparsity is introduced via Stochastic Length () according to the following criteria:
- $blue{(x_i)_{i=0}^{n_{c,j} \text{ if } n_{c,j} \leq N_c^{\alpha/2} }$
- $blue{(x_i)_{i=0}^{n_{c,j} \text{ if } n_{c,j} > N_c^{\alpha/2}, \text{w/ probability } N_c^\alpha / n_{c,j}^2}$
- : user 's history as a sequence, where is the number of contents user interacted with.
- : a subsequence of length constructed from the original sequence
- Activation memory usage is minimized through a simplified and fully fused design, reducing the number of linear layers and aggressively fusing computations into single operators.
- The algorithm M-FALCON (Microbatched-Fast Attention Leveraging Cacheable OperatioNs) performs inference for candidates with an input sequence size of .
Experimental Results
- HSTU outperforms baselines over synthetic and public datasets by up to 65.8\% in NDCG.
- HSTU-based GRs, with 1.5 trillion parameters, improve metrics in online A/B tests by 12.4\% and have been deployed on multiple surfaces of a large internet platform with billions of users.
- HSTU is up to blue{15.2x} and 5.6x more efficient than Transformers in training and inference, respectively.
- The model quality of GRs empirically scales as a power-law of training compute across three orders of magnitude, up to GPT-3/LLaMa-2 scale, reducing the carbon footprint needed for future model developments.
- GR achieves 1.50x/2.99x higher QPS when scoring 1024/16384 candidates, despite the GR model being 285x more computationally complex than production DLRMs.