Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Persia: An Open, Hybrid System Scaling Deep Learning-based Recommenders up to 100 Trillion Parameters (2111.05897v3)

Published 10 Nov 2021 in cs.LG and cs.DC

Abstract: Deep learning based models have dominated the current landscape of production recommender systems. Furthermore, recent years have witnessed an exponential growth of the model scale--from Google's 2016 model with 1 billion parameters to the latest Facebook's model with 12 trillion parameters. Significant quality boost has come with each jump of the model capacity, which makes us believe the era of 100 trillion parameters is around the corner. However, the training of such models is challenging even within industrial scale data centers. This difficulty is inherited from the staggering heterogeneity of the training computation--the model's embedding layer could include more than 99.99% of the total model size, which is extremely memory-intensive; while the rest neural network is increasingly computation-intensive. To support the training of such huge models, an efficient distributed training system is in urgent need. In this paper, we resolve this challenge by careful co-design of both the optimization algorithm and the distributed system architecture. Specifically, in order to ensure both the training efficiency and the training accuracy, we design a novel hybrid training algorithm, where the embedding layer and the dense neural network are handled by different synchronization mechanisms; then we build a system called Persia (short for parallel recommendation training system with hybrid acceleration) to support this hybrid training algorithm. Both theoretical demonstration and empirical study up to 100 trillion parameters have conducted to justified the system design and implementation of Persia. We make Persia publicly available (at https://github.com/PersiaML/Persia) so that anyone would be able to easily train a recommender model at the scale of 100 trillion parameters.

Citations (27)

Summary

  • The paper presents a hybrid training algorithm that splits asynchronous and synchronous updates to optimize throughput and accuracy in massive-scale recommender systems.
  • It employs a heterogeneous architecture combining CPU and GPU resources, achieving up to 7.12x faster training compared to existing frameworks.
  • The study validates an open-source framework that democratizes high-capacity recommenders and inspires future research in efficient distributed training.

Overview of "Persia: An Open, Hybrid System Scaling Deep Learning-based Recommenders up to 100 Trillion Parameters"

This paper introduces "Persia," a sophisticated distributed training system designed to support deep learning-based recommender models at a scale reaching up to 100 trillion parameters. The authors provide a detailed account of the system's architecture, which addresses the challenges posed by the heterogeneity within such large-scale models. Current models have seen exponential growth, from billions to trillions of parameters, necessitating efficient and scalable distributed training solutions.

Key Contributions

  1. Hybrid Training Algorithm: Persia employs a novel hybrid training algorithm that splits the model into asynchronous and synchronous components. Specifically, the embedding layer, which contains the majority of the parameters, is trained asynchronously to maximize throughput, while the dense neural network layers are trained synchronously to maintain statistical efficiency.
  2. System Architecture: The system architecture is designed to handle the heterogeneous nature of recommender systems efficiently. It utilizes a mix of CPU and GPU instances, taking advantage of the strengths of both. The architecture involves embedding parameter servers, embedding workers, and neural network workers, all operating under different paradigms tailored for their specific roles.
  3. Scalability and Performance: Through rigorous evaluations, Persia demonstrated a significant increase in training speed compared to existing systems like XDL and PaddlePaddle, with speed-ups up to 7.12 times on benchmark tests. The system efficiently scales with increasing numbers of parameters, maintaining high throughput even at 100 trillion parameters.

Methodology

The paper details Persia's hybrid algorithm, which allows for different synchronization mechanisms tailored to specific parts of the model. The embedding layer undergoes asynchronous updates, benefiting from sparse access patterns and reduced communication overheads, while dense layers are updated synchronously to preserve learning accuracy.

System Design and Implementation

Persia's design revolves around its adaptable infrastructure, supporting both synchronous and asynchronous updates seamlessly. It leverages efficient memory management, optimized communication protocols, and enhanced fault tolerance mechanisms. These features are crucial for managing the vast scale of data and parameters while ensuring resilient operations in distributed environments.

Theoretical and Empirical Evaluation

A theoretical analysis asserts that the hybrid algorithm achieves similar convergence rates to traditional synchronous approaches, underscoring its efficiency. Empirically, the system was tested across various configurations and datasets, proving its capability to maintain both high throughput and model accuracy.

Implications and Future Directions

Practically, Persia can democratize access to high-capacity recommender systems, previously confined to the largest tech companies, due to its open-source availability. Theoretically, the work validates the effectiveness of hybrid algorithms in managing the training of massive models efficiently.

Looking forward, the paradigm of combining asynchronous and synchronous training might inspire further developments in distributed training architectures. Future research could explore optimizing communication strategies further, leveraging advancements in hardware, or addressing other types of neural network models beyond recommendation systems.

In conclusion, Persia represents a significant step forward in scalable distributed training, providing an open-source framework capable of training some of the largest models conceivable today.

X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com