Wukong: Towards a Scaling Law for Large-Scale Recommendation (2403.02545v4)
Abstract: Scaling laws play an instrumental role in the sustainable improvement in model quality. Unfortunately, recommendation models to date do not exhibit such laws similar to those observed in the domain of LLMs, due to the inefficiencies of their upscaling mechanisms. This limitation poses significant challenges in adapting these models to increasingly more complex real-world datasets. In this paper, we propose an effective network architecture based purely on stacked factorization machines, and a synergistic upscaling strategy, collectively dubbed Wukong, to establish a scaling law in the domain of recommendation. Wukong's unique design makes it possible to capture diverse, any-order of interactions simply through taller and wider layers. We conducted extensive evaluations on six public datasets, and our results demonstrate that Wukong consistently outperforms state-of-the-art models quality-wise. Further, we assessed Wukong's scalability on an internal, large-scale dataset. The results show that Wukong retains its superiority in quality over state-of-the-art models, while holding the scaling law across two orders of magnitude in model complexity, extending beyond 100 GFLOP/example, where prior arts fall short.
- Anonymous. Dot product matrix compression for machine learning. Technical Disclosure Commons, 2019.
- Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- Baltrunas, L. Frappe - mobile app usage. URL https://www.baltrunas.info/context-aware.
- Latent cross: Making use of context in recurrent recommender systems. In Proceedings of the eleventh ACM international conference on web search and data mining, pp. 46–54, 2018.
- Higher-order factorization machines. Advances in Neural Information Processing Systems, 29, 2016.
- On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
- Temporal hierarchical attention at category- and item-level for micro-video click-through prediction. In MM, 2018.
- Adaptive factorization network: Learning adaptive-order feature interactions. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp. 3609–3616, 2020.
- Deep neural networks for youtube recommendations. In Proceedings of the 10th ACM conference on recommender systems, pp. 191–198, 2016.
- Criteo. Criteo 1tb click logs dataset. https://ailab.criteo.com/download-criteo-1tb-click-logs-dataset/.
- Deepfm: a factorization-machine based neural network for ctr prediction. arXiv preprint arXiv:1703.04247, 2017.
- The movielens datasets: History and context. ACM Trans. Interact. Intell. Syst., 5(4), dec 2015. ISSN 2160-6455. doi: 10.1145/2827872. URL https://doi.org/10.1145/2827872.
- Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
- Learning to embed categorical features without embedding tables for recommendation. arXiv preprint arXiv:2010.10784, 2020.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
- Kuaishou. URL https://www.kuaishou.com/activity/uimc.
- xdeepfm: Combining explicit and implicit feature interactions for recommender systems. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, pp. 1754–1763, 2018.
- Persia: An open, hybrid system scaling deep learning-based recommenders up to 100 trillion parameters. November 2021.
- Monolith: Real time recommendation system with collisionless embedding table. corr abs/2209.07663 (2022), 2022.
- Motivating in-network aggregation for distributed deep neural network training. In Workshop on Approximate Computing Across the Stack, 2017.
- Parameter hub: a rack-scale parameter server for distributed deep neural network training. In Proceedings of the ACM Symposium on Cloud Computing, pp. 41–54, 2018.
- Finalmlp: An enhanced two-stream mlp model for ctr prediction. arXiv preprint arXiv:2304.00902, 2023.
- High-performance, distributed training of large-scale deep learning recommendation models. arXiv preprint arXiv:2104.05158, 2021.
- Deep learning recommendation model for personalization and recommendation systems. arXiv preprint arXiv:1906.00091, 2019.
- On the difficulty of training recurrent neural networks. In International conference on machine learning, pp. 1310–1318. Pmlr, 2013.
- Torch. fx: Practical program capture and transformation for deep learning in python. Proceedings of Machine Learning and Systems, 4:638–651, 2022.
- Rendle, S. Factorization machines. In 2010 IEEE International Conference on Data Mining, pp. 995–1000. ieeexplore.ieee.org, December 2010.
- Sharma, S. Feature fusion for the uninitiated | by siddharth sharma | medium. https://siddharth-1729-65206.medium.com/feature-fusion-for-the-uninitiated-4c5938db28b7, 2023. (Accessed on 01/24/2024).
- Autoint: Automatic feature interaction learning via self-attentive neural networks. In Proceedings of the 28th ACM international conference on information and knowledge management, pp. 1161–1170, 2019.
- Improving training stability for multitask ranking models in recommender systems. arXiv preprint arXiv:2302.09178, 2023.
- Tianchi. Ad display/click data on taobao.com, 2018. URL https://tianchi.aliyun.com/dataset/dataDetail?dataId=56.
- Dcn v2: Improved deep & cross network and practical lessons for web-scale learning to rank systems. In Proceedings of the web conference 2021, pp. 1785–1797, 2021a.
- Masknet: Introducing feature-wise multiplication to ctr ranking models by instance-guided mask. arXiv preprint arXiv:2102.07619, 2021b.
- Pre-train and search: Efficient embedding table sharding with pre-trained neural cost models. Proceedings of Machine Learning and Systems, 5, 2023.
- Pytorch fsdp: experiences on scaling fully sharded data parallel. arXiv preprint arXiv:2304.11277, 2023.
- Open benchmarking for click-through rate prediction. In Demartini, G., Zuccon, G., Culpepper, J. S., Huang, Z., and Tong, H. (eds.), CIKM ’21: The 30th ACM International Conference on Information and Knowledge Management, Virtual Event, Queensland, Australia, November 1 - 5, 2021, pp. 2759–2769. ACM, 2021. doi: 10.1145/3459637.3482486. URL https://doi.org/10.1145/3459637.3482486.
- BARS: towards open benchmarking for recommender systems. In Amigó, E., Castells, P., Gonzalo, J., Carterette, B., Culpepper, J. S., and Kazai, G. (eds.), SIGIR ’22: The 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, July 11 - 15, 2022, pp. 2912–2923. ACM, 2022a. doi: 10.1145/3477495.3531723. URL https://doi.org/10.1145/3477495.3531723.
- BARS: towards open benchmarking for recommender systems. In Amigó, E., Castells, P., Gonzalo, J., Carterette, B., Culpepper, J. S., and Kazai, G. (eds.), SIGIR ’22: The 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, July 11 - 15, 2022, pp. 2912–2923. ACM, 2022b. doi: 10.1145/3477495.3531723. URL https://doi.org/10.1145/3477495.3531723.