Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations (2402.17152v3)

Published 27 Feb 2024 in cs.LG and cs.IR

Abstract: Large-scale recommendation systems are characterized by their reliance on high cardinality, heterogeneous features and the need to handle tens of billions of user actions on a daily basis. Despite being trained on huge volume of data with thousands of features, most Deep Learning Recommendation Models (DLRMs) in industry fail to scale with compute. Inspired by success achieved by Transformers in language and vision domains, we revisit fundamental design choices in recommendation systems. We reformulate recommendation problems as sequential transduction tasks within a generative modeling framework ("Generative Recommenders"), and propose a new architecture, HSTU, designed for high cardinality, non-stationary streaming recommendation data. HSTU outperforms baselines over synthetic and public datasets by up to 65.8% in NDCG, and is 5.3x to 15.2x faster than FlashAttention2-based Transformers on 8192 length sequences. HSTU-based Generative Recommenders, with 1.5 trillion parameters, improve metrics in online A/B tests by 12.4% and have been deployed on multiple surfaces of a large internet platform with billions of users. More importantly, the model quality of Generative Recommenders empirically scales as a power-law of training compute across three orders of magnitude, up to GPT-3/LLaMa-2 scale, which reduces carbon footprint needed for future model developments, and further paves the way for the first foundational models in recommendations.

PDF HTML Abstract

The paper introduces Generative Recommenders (GRs) that reformulate recommendation challenges as sequential transduction tasks. A new architecture called Hierarchical Sequential Transduction Units (HSTU) is proposed, which is designed for high cardinality and non-stationary streaming recommendation data. The paper demonstrates that HSTU outperforms existing Deep Learning Recommendation Models (DLRMs) and Transformers, achieving significant speedups and quality improvements in both offline and online settings. The key contributions and findings of the paper are summarized below.

Generative Recommenders (GRs)

The paper introduces a new paradigm called Generative Recommenders (GRs), which replaces traditional DLRMs.
GRs unify heterogeneous feature spaces in DLRMs by consolidating categorical and numerical features into a single time series. Numerical features are removed in GRs, assuming that a sufficiently expressive sequential transduction architecture can capture them as sequence length and compute increase.
The authors reformulate ranking and retrieval tasks as sequential transduction tasks, enabling model training in a sequential, generative fashion.
Generative training amortizes encoder costs across multiple targets, reducing computational complexity.

Hierarchical Sequential Transduction Units (HSTU)

A new sequential transduction architecture, Hierarchical Sequential Transduction Units (HSTU), is proposed to address computational cost challenges during training and inference.
HSTU modifies the attention mechanism for large, non-stationary vocabularies and exploits characteristics of recommendation datasets to achieve a blue{5.3x to 15.2x speedup vs FlashAttention2-based Transformers on 8192 length sequences}.
HSTU comprises Pointwise Projection (Equation 1), Spatial Aggregation (Equation 2), and Pointwise Transformation (Equation 3) sub-layers:
- $U(X), V(X), Q(X), K(X) = \text{Split}(\phi_1(f_1(X)))$
  - $X$ : Input
  - $U(X)$ : Gating weights
  - $V(X)$ : Values
  - $Q(X)$ : Queries
  - $K(X)$ : Keys
  - $f_1$ : MLP (one linear layer)
  - $\phi_1$ : nonlinearity (SiLU)
- $A(X)V(X) = \phi_2\left(Q(X)K(X)^T + \text{rab}^{p,t}\right)V(X)$
  - $A(X)$ : Attention weights
  - $\text{rab}^{p,t}$ : Relative attention bias incorporating positional ( $p$ ) and temporal ( $t$ ) information
  - $\phi_2$ : nonlinearity (SiLU)
- $Y(X) = f_2\left(\text{Norm}\left(A(X)V(X)\right) \odot U(X)\right)$
  - $Y(X)$ : Output
  - $\text{Norm}$ : Layer norm
  - $f_2$ : MLP (one linear layer)
A new pointwise aggregated attention mechanism is adopted, where the layer norm is needed after pointwise pooling to stabilize training.
The architecture leverages and algorithmically increases sparsity via Stochastic Length (SL), reducing encoder cost.
- Sparsity is introduced via Stochastic Length ( $SL$ ) according to the following criteria:
- $blue{(x_i)_{i=0}^{n_{c,j} \text{ if } n_{c,j} \leq N_c^{\alpha/2} }$
- $blue{(x_{i_k})_{k=0}^{N_c^{\alpha/2}} \text{ if } n_{c,j} > N_c^{\alpha/2}, \text{w/ probability } 1 - N_c^\alpha / n_{c,j}^2 }$
- $blue{(x_i)_{i=0}^{n_{c,j} \text{ if } n_{c,j} > N_c^{\alpha/2}, \text{w/ probability } N_c^\alpha / n_{c,j}^2}$
- $(x_i)_{i=0}^{n_{c, j}}$ : user $j$ 's history as a sequence, where $n_{c,j}$ is the number of contents user interacted with.
- $N_c = \max_j {n_{c,j}}$
- $(x_{i_k})_{k=0}^{L}$ : a subsequence of length $L$ constructed from the original sequence $(x_i)_{i=0}^{n_{c,j}}$
Activation memory usage is minimized through a simplified and fully fused design, reducing the number of linear layers and aggressively fusing computations into single operators.
The algorithm M-FALCON (Microbatched-Fast Attention Leveraging Cacheable OperatioNs) performs inference for $m$ candidates with an input sequence size of $n$ .

Experimental Results

HSTU outperforms baselines over synthetic and public datasets by up to 65.8\% in NDCG.
HSTU-based GRs, with 1.5 trillion parameters, improve metrics in online A/B tests by 12.4\% and have been deployed on multiple surfaces of a large internet platform with billions of users.
HSTU is up to blue{15.2x} and 5.6x more efficient than Transformers in training and inference, respectively.
The model quality of GRs empirically scales as a power-law of training compute across three orders of magnitude, up to GPT-3/LLaMa-2 scale, reducing the carbon footprint needed for future model developments.
GR achieves 1.50x/2.99x higher QPS when scoring 1024/16384 candidates, despite the GR model being 285x more computationally complex than production DLRMs.