Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations (2402.17152v3)

Published 27 Feb 2024 in cs.LG and cs.IR

Abstract: Large-scale recommendation systems are characterized by their reliance on high cardinality, heterogeneous features and the need to handle tens of billions of user actions on a daily basis. Despite being trained on huge volume of data with thousands of features, most Deep Learning Recommendation Models (DLRMs) in industry fail to scale with compute. Inspired by success achieved by Transformers in language and vision domains, we revisit fundamental design choices in recommendation systems. We reformulate recommendation problems as sequential transduction tasks within a generative modeling framework ("Generative Recommenders"), and propose a new architecture, HSTU, designed for high cardinality, non-stationary streaming recommendation data. HSTU outperforms baselines over synthetic and public datasets by up to 65.8% in NDCG, and is 5.3x to 15.2x faster than FlashAttention2-based Transformers on 8192 length sequences. HSTU-based Generative Recommenders, with 1.5 trillion parameters, improve metrics in online A/B tests by 12.4% and have been deployed on multiple surfaces of a large internet platform with billions of users. More importantly, the model quality of Generative Recommenders empirically scales as a power-law of training compute across three orders of magnitude, up to GPT-3/LLaMa-2 scale, which reduces carbon footprint needed for future model developments, and further paves the way for the first foundational models in recommendations.

The paper introduces Generative Recommenders (GRs) that reformulate recommendation challenges as sequential transduction tasks. A new architecture called Hierarchical Sequential Transduction Units (HSTU) is proposed, which is designed for high cardinality and non-stationary streaming recommendation data. The paper demonstrates that HSTU outperforms existing Deep Learning Recommendation Models (DLRMs) and Transformers, achieving significant speedups and quality improvements in both offline and online settings. The key contributions and findings of the paper are summarized below.

Generative Recommenders (GRs)

  • The paper introduces a new paradigm called Generative Recommenders (GRs), which replaces traditional DLRMs.
  • GRs unify heterogeneous feature spaces in DLRMs by consolidating categorical and numerical features into a single time series. Numerical features are removed in GRs, assuming that a sufficiently expressive sequential transduction architecture can capture them as sequence length and compute increase.
  • The authors reformulate ranking and retrieval tasks as sequential transduction tasks, enabling model training in a sequential, generative fashion.
  • Generative training amortizes encoder costs across multiple targets, reducing computational complexity.

Hierarchical Sequential Transduction Units (HSTU)

  • A new sequential transduction architecture, Hierarchical Sequential Transduction Units (HSTU), is proposed to address computational cost challenges during training and inference.
  • HSTU modifies the attention mechanism for large, non-stationary vocabularies and exploits characteristics of recommendation datasets to achieve a blue{5.3x to 15.2x speedup vs FlashAttention2-based Transformers on 8192 length sequences}.
  • HSTU comprises Pointwise Projection (Equation 1), Spatial Aggregation (Equation 2), and Pointwise Transformation (Equation 3) sub-layers:
    • U(X),V(X),Q(X),K(X)=Split(ϕ1(f1(X)))U(X), V(X), Q(X), K(X) = \text{Split}(\phi_1(f_1(X)))
      • XX: Input
      • U(X)U(X): Gating weights
      • V(X)V(X): Values
      • Q(X)Q(X): Queries
      • K(X)K(X): Keys
      • f1f_1: MLP (one linear layer)
      • ϕ1\phi_1: nonlinearity (SiLU)
    • A(X)V(X)=ϕ2(Q(X)K(X)T+rabp,t)V(X)A(X)V(X) = \phi_2\left(Q(X)K(X)^T + \text{rab}^{p,t}\right)V(X)
      • A(X)A(X): Attention weights
      • rabp,t\text{rab}^{p,t}: Relative attention bias incorporating positional (pp) and temporal (tt) information
      • ϕ2\phi_2: nonlinearity (SiLU)
    • Y(X)=f2(Norm(A(X)V(X))U(X))Y(X) = f_2\left(\text{Norm}\left(A(X)V(X)\right) \odot U(X)\right)
      • Y(X)Y(X): Output
      • Norm\text{Norm}: Layer norm
      • f2f_2: MLP (one linear layer)
  • A new pointwise aggregated attention mechanism is adopted, where the layer norm is needed after pointwise pooling to stabilize training.
  • The architecture leverages and algorithmically increases sparsity via Stochastic Length (SL), reducing encoder cost.
    • Sparsity is introduced via Stochastic Length (SLSL) according to the following criteria:
    • $blue{(x_i)_{i=0}^{n_{c,j} \text{ if } n_{c,j} \leq N_c^{\alpha/2} }$
    • blue(xik)k=0Ncα/2 if nc,j>Ncα/2,w/ probability 1Ncα/nc,j2blue{(x_{i_k})_{k=0}^{N_c^{\alpha/2}} \text{ if } n_{c,j} > N_c^{\alpha/2}, \text{w/ probability } 1 - N_c^\alpha / n_{c,j}^2 }
    • $blue{(x_i)_{i=0}^{n_{c,j} \text{ if } n_{c,j} > N_c^{\alpha/2}, \text{w/ probability } N_c^\alpha / n_{c,j}^2}$
    • (xi)i=0nc,j(x_i)_{i=0}^{n_{c, j}}: user jj's history as a sequence, where nc,jn_{c,j} is the number of contents user interacted with.
    • Nc=maxjnc,jN_c = \max_j {n_{c,j}}
    • (xik)k=0L(x_{i_k})_{k=0}^{L}: a subsequence of length LL constructed from the original sequence (xi)i=0nc,j(x_i)_{i=0}^{n_{c,j}}
  • Activation memory usage is minimized through a simplified and fully fused design, reducing the number of linear layers and aggressively fusing computations into single operators.
  • The algorithm M-FALCON (Microbatched-Fast Attention Leveraging Cacheable OperatioNs) performs inference for mm candidates with an input sequence size of nn.

Experimental Results

  • HSTU outperforms baselines over synthetic and public datasets by up to 65.8\% in NDCG.
  • HSTU-based GRs, with 1.5 trillion parameters, improve metrics in online A/B tests by 12.4\% and have been deployed on multiple surfaces of a large internet platform with billions of users.
  • HSTU is up to blue{15.2x} and 5.6x more efficient than Transformers in training and inference, respectively.
  • The model quality of GRs empirically scales as a power-law of training compute across three orders of magnitude, up to GPT-3/LLaMa-2 scale, reducing the carbon footprint needed for future model developments.
  • GR achieves 1.50x/2.99x higher QPS when scoring 1024/16384 candidates, despite the GR model being 285x more computationally complex than production DLRMs.
Definition Search Book Streamline Icon: https://streamlinehq.com
References (64)
  1. Tallrec: An effective and efficient tuning framework to align large language model with recommendation. In Proceedings of the 17th ACM Conference on Recommender Systems, RecSys ’23. ACM, September 2023. doi: 10.1145/3604915.3608857. URL http://dx.doi.org/10.1145/3604915.3608857.
  2. Language models are few-shot learners. 2020.
  3. Twin: Two-stage interest network for lifelong user behavior modeling in ctr prediction at kuaishou, 2023.
  4. Behavior sequence transformer for e-commerce recommendation in alibaba. In Proceedings of the 1st International Workshop on Deep Learning Practice for High-Dimensional Sparse Data, DLP-KDD ’19, New York, NY, USA, 2019. Association for Computing Machinery. ISBN 9781450367837. doi: 10.1145/3326937.3341261. URL https://doi.org/10.1145/3326937.3341261.
  5. A simple framework for contrastive learning of visual representations. In Proceedings of the 37th International Conference on Machine Learning, ICML’20, 2020.
  6. Wide & deep learning for recommender systems. In Proceedings of the 1st Workshop on Deep Learning for Recommender Systems, DLRS 2016, pp.  7–10, 2016. ISBN 9781450347952.
  7. Generating long sequences with sparse transformers. CoRR, abs/1904.10509, 2019. URL http://arxiv.org/abs/1904.10509.
  8. Deep neural networks for youtube recommendations. In Proceedings of the 10th ACM Conference on Recommender Systems, RecSys ’16, pp.  191–198, 2016. ISBN 9781450340359.
  9. M6-rec: Generative pretrained language models are open-ended recommender systems, 2022.
  10. A case study on sampling strategies for evaluating neural sequential item recommendation models. In Proceedings of the 15th ACM Conference on Recommender Systems, RecSys ’21, pp.  505–514, 2021. ISBN 9781450384582.
  11. Dao, T. Flashattention-2: Faster attention with better parallelism and work partitioning, 2023.
  12. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems, 2022.
  13. Pixie: A system for recommending 3+ billion items to 200+ million users in real-time. In Proceedings of the 2018 World Wide Web Conference, WWW ’18, pp.  1775–1784, 2018. ISBN 9781450356398.
  14. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. CoRR, abs/1702.03118, 2017. URL http://arxiv.org/abs/1702.03118.
  15. Efficiently modeling long sequences with structured state spaces. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. URL https://openreview.net/forum?id=uYLFoz1vlAC.
  16. Deepfm: A factorization-machine based neural network for ctr prediction. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, IJCAI’17, pp.  1725–1731, 2017. ISBN 9780999241103.
  17. Training highly multiclass classifiers. J. Mach. Learn. Res., 15(1):1461–1492, jan 2014. ISSN 1532-4435.
  18. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385, 2015.
  19. Practical lessons from predicting clicks on ads at facebook. In ADKDD’14: Proceedings of the Eighth International Workshop on Data Mining for Online Advertising, New York, NY, USA, 2014. Association for Computing Machinery. ISBN 9781450329996.
  20. Large language models are zero-shot rankers for recommender systems. In Advances in Information Retrieval - 46th European Conference on IR Research, ECIR 2024, 2024.
  21. Transformer quality in linear time. In Chaudhuri, K., Jegelka, S., Song, L., Szepesvári, C., Niu, G., and Sabato, S. (eds.), International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pp.  9099–9117. PMLR, 2022. URL https://proceedings.mlr.press/v162/hua22a.html.
  22. Deep networks with stochastic depth, 2016.
  23. Product quantization for nearest neighbor search. IEEE Trans. Pattern Anal. Mach. Intell., 33(1):117–128, jan 2011. ISSN 0162-8828. doi: 10.1109/TPAMI.2010.57. URL https://doi.org/10.1109/TPAMI.2010.57.
  24. Self-attentive sequential recommendation. In 2018 International Conference on Data Mining (ICDM), pp.  197–206, 2018.
  25. Scaling laws for neural language models. CoRR, abs/2001.08361, 2020. URL https://arxiv.org/abs/2001.08361.
  26. Transformers are rnns: Fast autoregressive transformers with linear attention. In Proceedings of the 37th International Conference on Machine Learning, ICML’20. JMLR.org, 2020.
  27. Fbgemm: Enabling high-performance low-precision deep learning inference. arXiv preprint arXiv:2101.05615, 2021.
  28. Reducing activation recomputation in large transformer models, 2022.
  29. Clustering for approximate similarity search in high-dimensional spaces. IEEE Transactions on Knowledge and Data Engineering, 14(4):792–808, 2002.
  30. Text is all you need: Learning language representations for sequential recommendation. In KDD, 2023.
  31. Monolith: Real time recommendation system with collisionless embedding table, 2022.
  32. Modeling task relationships in multi-task learning with multi-gate mixture-of-experts. KDD ’18, 2018.
  33. Software-hardware co-design for fast and scalable training of deep learning recommendation models. In Proceedings of the 49th Annual International Symposium on Computer Architecture, ISCA ’22, pp.  993–1011, New York, NY, USA, 2022. Association for Computing Machinery. ISBN 9781450386104. doi: 10.1145/3470496.3533727. URL https://doi.org/10.1145/3470496.3533727.
  34. YaRN: Efficient context window extension of large language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=wHBfxhZu1u.
  35. Efficiently scaling transformer inference, 2022.
  36. Train short, test long: Attention with linear biases enables input length extrapolation. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. URL https://openreview.net/forum?id=R8sQPpGCv0.
  37. Self-attention does not need o⁢(n2)𝑜superscript𝑛2o(n^{2})italic_o ( italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) memory, 2021.
  38. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(1), jan 2020. ISSN 1532-4435.
  39. Rendle, S. Factorization machines. In 2010 IEEE International Conference on Data Mining (ICDM), pp.  995–1000, 2010. doi: 10.1109/ICDM.2010.127.
  40. Neural collaborative filtering vs. matrix factorization revisited. In Fourteenth ACM Conference on Recommender Systems (RecSys’20), pp.  240–248, 2020. ISBN 9781450375832.
  41. Shazeer, N. Glu variants improve transformer, 2020.
  42. Scaling law for recommendation models: towards general-purpose user representations. In Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence, AAAI’23/IAAI’23/EAAI’23. AAAI Press, 2023. ISBN 978-1-57735-880-0. doi: 10.1609/aaai.v37i4.25582. URL https://doi.org/10.1609/aaai.v37i4.25582.
  43. Asymmetric lsh (alsh) for sublinear time maximum inner product search (mips). In Advances in Neural Information Processing Systems, volume 27, 2014.
  44. Zero-shot recommendation as language modeling. In Hagen, M., Verberne, S., Macdonald, C., Seifert, C., Balog, K., Nørvåg, K., and Setty, V. (eds.), Advances in Information Retrieval - 44th European Conference on IR Research, ECIR 2022, Stavanger, Norway, April 10-14, 2022, Proceedings, Part II, volume 13186 of Lecture Notes in Computer Science, pp.  223–230. Springer, 2022. doi: 10.1007/978-3-030-99739-7_26. URL https://doi.org/10.1007/978-3-030-99739-7_26.
  45. Roformer: Enhanced transformer with rotary position embedding, 2023.
  46. Bert4rec: Sequential recommendation with bidirectional encoder representations from transformer. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, CIKM ’19, pp.  1441–1450, 2019. ISBN 9781450369763.
  47. Progressive layered extraction (ple): A novel multi-task learning (mtl) model for personalized recommendations. In Proceedings of the 14th ACM Conference on Recommender Systems, RecSys ’20, pp.  269–278, New York, NY, USA, 2020. Association for Computing Machinery. ISBN 9781450375832. doi: 10.1145/3383313.3412236. URL https://doi.org/10.1145/3383313.3412236.
  48. Llama: Open and efficient foundation language models, 2023a.
  49. Llama 2: Open foundation and fine-tuned chat models, 2023b.
  50. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, pp.  6000–6010, 2017. ISBN 9781510860964.
  51. Dcn v2: Improved deep & cross network and practical lessons for web-scale learning to rank systems. In Proceedings of the Web Conference 2021, WWW ’21, pp.  1785–1797, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450383127. doi: 10.1145/3442381.3450078. URL https://doi.org/10.1145/3442381.3450078.
  52. Cold: Towards the next generation of pre-ranking system, 2020.
  53. Transact: Transformer-based realtime user action model for recommendation at pinterest. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD ’23, pp.  5249–5259, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9798400701030. doi: 10.1145/3580305.3599918. URL https://doi.org/10.1145/3580305.3599918.
  54. Attentional factorization machines: Learning the weight of feature interactions via attention networks. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, IJCAI’17, pp.  3119–3125. AAAI Press, 2017. ISBN 9780999241103.
  55. On layer normalization in the transformer architecture. In Proceedings of the 37th International Conference on Machine Learning, ICML’20. JMLR.org, 2020.
  56. Mixed negative sampling for learning two-tower neural networks in recommendations. In Companion Proceedings of the Web Conference 2020, WWW ’20, pp.  441–447, 2020. ISBN 9781450370240.
  57. Atlas: A probabilistic algorithm for high dimensional similarity search. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, SIGMOD ’11, pp.  997–1008, 2011. ISBN 9781450306614.
  58. Revisiting neural retrieval on accelerators. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD ’23, pp.  5520–5531, New York, NY, USA, 2023a. Association for Computing Machinery. ISBN 9798400701030. doi: 10.1145/3580305.3599897. URL https://doi.org/10.1145/3580305.3599897.
  59. Bytetransformer: A high-performance transformer boosted for variable-length inputs. In 2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp.  344–355, Los Alamitos, CA, USA, may 2023b. IEEE Computer Society. doi: 10.1109/IPDPS54959.2023.00042. URL https://doi.ieeecomputersociety.org/10.1109/IPDPS54959.2023.00042.
  60. Dhen: A deep and hierarchical ensemble network for large-scale click-through rate prediction, 2022.
  61. Breaking the curse of quality saturation with user-centric ranking, 2023.
  62. Deep interest network for click-through rate prediction. KDD ’18, 2018.
  63. S3-rec: Self-supervised learning for sequential recommendation with mutual information maximization. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, CIKM ’20, pp.  1893–1902, New York, NY, USA, 2020. Association for Computing Machinery. ISBN 9781450368599. doi: 10.1145/3340531.3411954. URL https://doi.org/10.1145/3340531.3411954.
  64. Learning optimal tree models under beam search. In Proceedings of the 37th International Conference on Machine Learning, ICML’20. JMLR.org, 2020.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (12)
  1. Jiaqi Zhai (9 papers)
  2. Lucy Liao (1 paper)
  3. Xing Liu (97 papers)
  4. Yueming Wang (23 papers)
  5. Rui Li (384 papers)
  6. Xuan Cao (19 papers)
  7. Leon Gao (2 papers)
  8. Zhaojie Gong (2 papers)
  9. Fangda Gu (7 papers)
  10. Michael He (2 papers)
  11. Yinghai Lu (5 papers)
  12. Yu Shi (153 papers)
Citations (21)
Youtube Logo Streamline Icon: https://streamlinehq.com