Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Hercules: Heterogeneity-Aware Inference Serving for At-Scale Personalized Recommendation (2203.07424v1)

Published 14 Mar 2022 in cs.DC

Abstract: Personalized recommendation is an important class of deep-learning applications that powers a large collection of internet services and consumes a considerable amount of datacenter resources. As the scale of production-grade recommendation systems continues to grow, optimizing their serving performance and efficiency in a heterogeneous datacenter is important and can translate into infrastructure capacity saving. In this paper, we propose Hercules, an optimized framework for personalized recommendation inference serving that targets diverse industry-representative models and cloud-scale heterogeneous systems. Hercules performs a two-stage optimization procedure - offline profiling and online serving. The first stage searches the large under-explored task scheduling space with a gradient-based search algorithm achieving up to 9.0x latency-bounded throughput improvement on individual servers; it also identifies the optimal heterogeneous server architecture for each recommendation workload. The second stage performs heterogeneity-aware cluster provisioning to optimize resource mapping and allocation in response to fluctuating diurnal loads. The proposed cluster scheduler in Hercules achieves 47.7% cluster capacity saving and reduces the provisioned power by 23.7% over a state-of-the-art greedy scheduler.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Liu Ke (7 papers)
  2. Udit Gupta (30 papers)
  3. Mark Hempstead (4 papers)
  4. Carole-Jean Wu (62 papers)
  5. Hsien-Hsin S. Lee (16 papers)
  6. Xuan Zhang (183 papers)
Citations (17)

Summary

We haven't generated a summary for this paper yet.