Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LRC: Dependency-Aware Cache Management for Data Analytics Clusters (1703.08280v1)

Published 24 Mar 2017 in cs.DC

Abstract: Memory caches are being aggressively used in today's data-parallel systems such as Spark, Tez, and Piccolo. However, prevalent systems employ rather simple cache management policies--notably the Least Recently Used (LRU) policy--that are oblivious to the application semantics of data dependency, expressed as a directed acyclic graph (DAG). Without this knowledge, memory caching can at best be performed by "guessing" the future data access patterns based on historical information (e.g., the access recency and/or frequency), which frequently results in inefficient, erroneous caching with low hit ratio and a long response time. In this paper, we propose a novel cache replacement policy, Least Reference Count (LRC), which exploits the application-specific DAG information to optimize the cache management. LRC evicts the cached data blocks whose reference count is the smallest. The reference count is defined, for each data block, as the number of dependent child blocks that have not been computed yet. We demonstrate the efficacy of LRC through both empirical analysis and cluster deployments against popular benchmarking workloads. Our Spark implementation shows that, compared with LRU, LRC speeds up typical applications by 60%.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Yinghao Yu (8 papers)
  2. Wei Wang (1793 papers)
  3. Jun Zhang (1008 papers)
  4. Khaled Ben Letaief (27 papers)
Citations (43)