Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
116 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
3 tokens/sec
DeepSeek R1 via Azure Pro
55 tokens/sec
2000 character limit reached

Wukong: Towards a Scaling Law for Large-Scale Recommendation (2403.02545v4)

Published 4 Mar 2024 in cs.LG and cs.AI

Abstract: Scaling laws play an instrumental role in the sustainable improvement in model quality. Unfortunately, recommendation models to date do not exhibit such laws similar to those observed in the domain of LLMs, due to the inefficiencies of their upscaling mechanisms. This limitation poses significant challenges in adapting these models to increasingly more complex real-world datasets. In this paper, we propose an effective network architecture based purely on stacked factorization machines, and a synergistic upscaling strategy, collectively dubbed Wukong, to establish a scaling law in the domain of recommendation. Wukong's unique design makes it possible to capture diverse, any-order of interactions simply through taller and wider layers. We conducted extensive evaluations on six public datasets, and our results demonstrate that Wukong consistently outperforms state-of-the-art models quality-wise. Further, we assessed Wukong's scalability on an internal, large-scale dataset. The results show that Wukong retains its superiority in quality over state-of-the-art models, while holding the scaling law across two orders of magnitude in model complexity, extending beyond 100 GFLOP/example, where prior arts fall short.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (38)
  1. Anonymous. Dot product matrix compression for machine learning. Technical Disclosure Commons, 2019.
  2. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
  3. Baltrunas, L. Frappe - mobile app usage. URL https://www.baltrunas.info/context-aware.
  4. Latent cross: Making use of context in recurrent recommender systems. In Proceedings of the eleventh ACM international conference on web search and data mining, pp.  46–54, 2018.
  5. Higher-order factorization machines. Advances in Neural Information Processing Systems, 29, 2016.
  6. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
  7. Temporal hierarchical attention at category- and item-level for micro-video click-through prediction. In MM, 2018.
  8. Adaptive factorization network: Learning adaptive-order feature interactions. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp.  3609–3616, 2020.
  9. Deep neural networks for youtube recommendations. In Proceedings of the 10th ACM conference on recommender systems, pp.  191–198, 2016.
  10. Criteo. Criteo 1tb click logs dataset. https://ailab.criteo.com/download-criteo-1tb-click-logs-dataset/.
  11. Deepfm: a factorization-machine based neural network for ctr prediction. arXiv preprint arXiv:1703.04247, 2017.
  12. The movielens datasets: History and context. ACM Trans. Interact. Intell. Syst., 5(4), dec 2015. ISSN 2160-6455. doi: 10.1145/2827872. URL https://doi.org/10.1145/2827872.
  13. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
  14. Learning to embed categorical features without embedding tables for recommendation. arXiv preprint arXiv:2010.10784, 2020.
  15. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  16. Kuaishou. URL https://www.kuaishou.com/activity/uimc.
  17. xdeepfm: Combining explicit and implicit feature interactions for recommender systems. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, pp.  1754–1763, 2018.
  18. Persia: An open, hybrid system scaling deep learning-based recommenders up to 100 trillion parameters. November 2021.
  19. Monolith: Real time recommendation system with collisionless embedding table. corr abs/2209.07663 (2022), 2022.
  20. Motivating in-network aggregation for distributed deep neural network training. In Workshop on Approximate Computing Across the Stack, 2017.
  21. Parameter hub: a rack-scale parameter server for distributed deep neural network training. In Proceedings of the ACM Symposium on Cloud Computing, pp. 41–54, 2018.
  22. Finalmlp: An enhanced two-stream mlp model for ctr prediction. arXiv preprint arXiv:2304.00902, 2023.
  23. High-performance, distributed training of large-scale deep learning recommendation models. arXiv preprint arXiv:2104.05158, 2021.
  24. Deep learning recommendation model for personalization and recommendation systems. arXiv preprint arXiv:1906.00091, 2019.
  25. On the difficulty of training recurrent neural networks. In International conference on machine learning, pp. 1310–1318. Pmlr, 2013.
  26. Torch. fx: Practical program capture and transformation for deep learning in python. Proceedings of Machine Learning and Systems, 4:638–651, 2022.
  27. Rendle, S. Factorization machines. In 2010 IEEE International Conference on Data Mining, pp. 995–1000. ieeexplore.ieee.org, December 2010.
  28. Sharma, S. Feature fusion for the uninitiated | by siddharth sharma | medium. https://siddharth-1729-65206.medium.com/feature-fusion-for-the-uninitiated-4c5938db28b7, 2023. (Accessed on 01/24/2024).
  29. Autoint: Automatic feature interaction learning via self-attentive neural networks. In Proceedings of the 28th ACM international conference on information and knowledge management, pp.  1161–1170, 2019.
  30. Improving training stability for multitask ranking models in recommender systems. arXiv preprint arXiv:2302.09178, 2023.
  31. Tianchi. Ad display/click data on taobao.com, 2018. URL https://tianchi.aliyun.com/dataset/dataDetail?dataId=56.
  32. Dcn v2: Improved deep & cross network and practical lessons for web-scale learning to rank systems. In Proceedings of the web conference 2021, pp.  1785–1797, 2021a.
  33. Masknet: Introducing feature-wise multiplication to ctr ranking models by instance-guided mask. arXiv preprint arXiv:2102.07619, 2021b.
  34. Pre-train and search: Efficient embedding table sharding with pre-trained neural cost models. Proceedings of Machine Learning and Systems, 5, 2023.
  35. Pytorch fsdp: experiences on scaling fully sharded data parallel. arXiv preprint arXiv:2304.11277, 2023.
  36. Open benchmarking for click-through rate prediction. In Demartini, G., Zuccon, G., Culpepper, J. S., Huang, Z., and Tong, H. (eds.), CIKM ’21: The 30th ACM International Conference on Information and Knowledge Management, Virtual Event, Queensland, Australia, November 1 - 5, 2021, pp.  2759–2769. ACM, 2021. doi: 10.1145/3459637.3482486. URL https://doi.org/10.1145/3459637.3482486.
  37. BARS: towards open benchmarking for recommender systems. In Amigó, E., Castells, P., Gonzalo, J., Carterette, B., Culpepper, J. S., and Kazai, G. (eds.), SIGIR ’22: The 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, July 11 - 15, 2022, pp.  2912–2923. ACM, 2022a. doi: 10.1145/3477495.3531723. URL https://doi.org/10.1145/3477495.3531723.
  38. BARS: towards open benchmarking for recommender systems. In Amigó, E., Castells, P., Gonzalo, J., Carterette, B., Culpepper, J. S., and Kazai, G. (eds.), SIGIR ’22: The 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, July 11 - 15, 2022, pp.  2912–2923. ACM, 2022b. doi: 10.1145/3477495.3531723. URL https://doi.org/10.1145/3477495.3531723.
Citations (8)

Summary

  • The paper presents Wukong, an innovative network built on stacked factorization machines to capture high-order feature interactions in recommendation tasks.
  • It emphasizes dense scaling by enhancing interaction components instead of merely expanding embedding tables, reducing computational resource demands.
  • Empirical evaluations across diverse datasets demonstrate consistent AUC gains and robustness over a 100 GFLOP complexity range.

Analyzing "Wukong: Towards a Scaling Law for Large-Scale Recommendation"

The paper "Wukong: Towards a Scaling Law for Large-Scale Recommendation" presents the development of a novel network architecture, Wukong, designed explicitly for improving recommendation models' scalability and efficiency. The research addresses a persistent challenge—establishing scaling laws for recommendation systems comparable to those observed in LLMs.

Core Contributions

  1. Architecture Design: Wukong is built on stacked factorization machines (FMs), a design that distinguishes it from traditional recommendation models. The architecture is aimed at capturing diverse and high-order interactions among features by employing deeper and wider network layers. This characteristic is particularly crucial for recommendation tasks requiring high-level reasoning and interaction modeling.
  2. Scalable Interaction Component: Unlike existing models, which often rely heavily on expanding embedding tables (sparse scaling), Wukong emphasizes dense scaling. By focusing on upscaling interaction components rather than just increasing the size of embedding tables, Wukong achieves better quality improvements while maintaining or reducing infrastructure costs.
  3. Empirical Validation: The authors conducted extensive evaluations across six public datasets and one large-scale proprietary dataset. Across varying model complexities—extending beyond 100 GFLOP/example—Wukong consistently outperforms state-of-the-art baseline models. This consistent performance demonstrates its robustness across different complexity scales and diverse datasets.

Significant Findings

  • Wukong delivers superior predictive accuracy as measured by AUC across all tested datasets, highlighting its efficacy in various recommendation scenarios.
  • The architecture's ability to uphold scaling laws is shown by its continuous quality improvement over two orders of magnitude in model complexity.
  • Wukong achieves scalability without significant loss in model efficiency, aligning well with hardware capabilities that favor enhanced compute over memory.

Implications and Future Perspectives

The development of Wukong has practical and theoretical implications for future AI and recommendation research. Practically, it provides a scalable backbone for deploying recommendation systems that can adapt to rapidly increasing dataset complexity and size without prohibitive computational costs. Theoretically, it opens avenues for further exploration of scaling laws in domains beyond LLMs, potentially setting precedence for similar constructs in other machine learning tasks.

Future studies could investigate the limits of Wukong’s scalability, explore its applicability in different contexts like sequential recommendations, or even its potential compatibility and interaction with transformer-based architectures. Moreover, developing and evaluating efficient serving strategies for such scaled-up models could enhance the deployment and real-time usability of these systems.

In conclusion, Wukong marks an important stride towards deriving scaling laws in recommendation systems, presenting a robust alternative to traditional upscaling strategies that hinge on mere expansion of embedding tables. Its innovative approach to interaction modeling and efficacy across scale presents a valuable asset to both academic research and practical applications in recommendation systems.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.