Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DeepRecSys: A System for Optimizing End-To-End At-scale Neural Recommendation Inference (2001.02772v1)

Published 8 Jan 2020 in cs.DC

Abstract: Neural personalized recommendation is the corner-stone of a wide collection of cloud services and products, constituting significant compute demand of the cloud infrastructure. Thus, improving the execution efficiency of neural recommendation directly translates into infrastructure capacity saving. In this paper, we devise a novel end-to-end modeling infrastructure, DeepRecInfra, that adopts an algorithm and system co-design methodology to custom-design systems for recommendation use cases. Leveraging the insights from the recommendation characterization, a new dynamic scheduler, DeepRecSched, is proposed to maximize latency-bounded throughput by taking into account characteristics of inference query size and arrival patterns, recommendation model architectures, and underlying hardware systems. By doing so, system throughput is doubled across the eight industry-representative recommendation models. Finally, design, deployment, and evaluation in at-scale production datacenter shows over 30% latency reduction across a wide variety of recommendation models running on hundreds of machines.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Udit Gupta (30 papers)
  2. Samuel Hsia (9 papers)
  3. Vikram Saraph (9 papers)
  4. Xiaodong Wang (229 papers)
  5. Brandon Reagen (39 papers)
  6. Gu-Yeon Wei (54 papers)
  7. Hsien-Hsin S. Lee (16 papers)
  8. David Brooks (204 papers)
  9. Carole-Jean Wu (62 papers)
Citations (177)

Summary

  • The paper presents DeepRecInfra, an end-to-end framework that models realistic query patterns and balances request and batch parallelism for neural recommendation systems.
  • DeepRecSched exploits hardware offloading and dynamic batch sizing to double throughput and achieve up to 2.9× improvements in power efficiency.
  • The research underscores the critical balance between performance and power efficiency in scaling personalized recommendation inference on cloud infrastructures.

An Overview of DeepRecSys: Enhancing End-To-End Neural Recommendation Inference

The paper "DeepRecSys: A System for Optimizing End-To-End At-scale Neural Recommendation Inference" presents a comprehensive framework, DeepRecInfra, aimed at improving the execution efficiency of neural-based personalized recommendation systems. This enhancement is crucial given that recommendation algorithms necessitate substantial computational resources in cloud infrastructures, thus representing a significant area for optimization.

Introduction to Neural Recommendation Models

Recommendation systems are prevalent across various online platforms to enhance user experience by providing personalized content. The shift from simple rule-based methods to complex deep learning models has enabled improved accuracy by leveraging both dense and sparse input features. Dense features represent continuous inputs, processed through multi-layer perceptrons (MLPs), whereas sparse features entail categorical data that require embedding tables followed by pooling operations.

DeepRecInfra: Infrastructure Design

DeepRecInfra is developed to provide an end-to-end modeling infrastructure that captures diverse characteristics of real-world recommendation models and their execution patterns. Critical components of the infrastructure include:

  • Recommendation Models: Eight industry-representative models are featured, each with a distinct architecture, including different processing units for dense inputs, embedding tables, feature pooling, and predictive neural stacks.
  • Tail Latency Targets: The infrastructure respects application-specific latency targets under varying service-level agreements (SLAs).
  • Query Patterns: DeepRecInfra models realistic query arrival rates using Poisson distribution and query size distributions observed in production environments, which exhibit characteristics distinct from traditional web-services.

Optimizing Performance with DeepRecSched

DeepRecSched, implemented on DeepRecInfra, targets system throughput maximization under strict latency constraints through two primary optimization strategies:

  1. Request vs. Batch Parallelism: By parsing queries into smaller batch sizes, DeepRecSched exploits greater parallelism across available CPU cores, balancing request-level and batch-level parallelism to optimize performance against cache contention.
  2. Hardware Offloading: For large queries, transfer to specialized hardware accelerators like GPUs is considered. This leverages their computational power to reduce processing time, albeit with higher power consumption, thus necessitating careful dynamic configuration.

Empirical Results and Implications

DeepRecSched achieves significant performance improvements over static scheduling configurations, doubling system throughput and improving power efficiency by up to 2.9 times. Moreover, this scheduler configuration provides considerable tail latency reductions, as evidenced by evaluations in production datacenter scenarios.

In conclusion, the research addresses a crucial aspect of recommendation systems at scale via a novel infrastructure and scheduler design. Though GPUs provide performance benefits, the paper highlights the necessity of balanced optimization to achieve power efficiency in light of the diverse model architectures and operating conditions. This paper provides a groundwork for future exploration into specialized hardware configurations and further efficiency improvements in AI infrastructure handling recommendation systems.