Hercules: Heterogeneity-Aware Inference Serving for At-Scale Personalized Recommendation (2203.07424v1)

Published 14 Mar 2022 in cs.DC

Abstract: Personalized recommendation is an important class of deep-learning applications that powers a large collection of internet services and consumes a considerable amount of datacenter resources. As the scale of production-grade recommendation systems continues to grow, optimizing their serving performance and efficiency in a heterogeneous datacenter is important and can translate into infrastructure capacity saving. In this paper, we propose Hercules, an optimized framework for personalized recommendation inference serving that targets diverse industry-representative models and cloud-scale heterogeneous systems. Hercules performs a two-stage optimization procedure - offline profiling and online serving. The first stage searches the large under-explored task scheduling space with a gradient-based search algorithm achieving up to 9.0x latency-bounded throughput improvement on individual servers; it also identifies the optimal heterogeneous server architecture for each recommendation workload. The second stage performs heterogeneity-aware cluster provisioning to optimize resource mapping and allocation in response to fluctuating diurnal loads. The proposed cluster scheduler in Hercules achieves 47.7% cluster capacity saving and reduces the provisioned power by 23.7% over a state-of-the-art greedy scheduler.

Authors (6)

Liu Ke (7 papers)
Udit Gupta (30 papers)
Mark Hempstead (4 papers)
Carole-Jean Wu (62 papers)
Hsien-Hsin S. Lee (16 papers)
Xuan Zhang (183 papers)

Citations (17)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Hercules: Heterogeneity-Aware Inference Serving for At-Scale Personalized Recommendation (2203.07424v1)

Summary

Related Papers