Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Interactive Analytical Processing in Big Data Systems: A Cross-Industry Study of MapReduce Workloads (1208.4174v1)

Published 21 Aug 2012 in cs.DB

Abstract: Within the past few years, organizations in diverse industries have adopted MapReduce-based systems for large-scale data processing. Along with these new users, important new workloads have emerged which feature many small, short, and increasingly interactive jobs in addition to the large, long-running batch jobs for which MapReduce was originally designed. As interactive, large-scale query processing is a strength of the RDBMS community, it is important that lessons from that field be carried over and applied where possible in this new domain. However, these new workloads have not yet been described in the literature. We fill this gap with an empirical analysis of MapReduce traces from six separate business-critical deployments inside Facebook and at Cloudera customers in e-commerce, telecommunications, media, and retail. Our key contribution is a characterization of new MapReduce workloads which are driven in part by interactive analysis, and which make heavy use of query-like programming frameworks on top of MapReduce. These workloads display diverse behaviors which invalidate prior assumptions about MapReduce such as uniform data access, regular diurnal patterns, and prevalence of large jobs. A secondary contribution is a first step towards creating a TPC-like data processing benchmark for MapReduce.

Citations (546)

Summary

  • The paper characterizes novel interactive and semi-streaming MapReduce workloads that handle smaller data sizes and short job durations, challenging conventional batch paradigms.
  • It analyzes over a year of production data from diverse industries to identify skewed data access patterns and bursty job submission rates.
  • The research shows that query-like frameworks such as Hive and Pig drive significant activity, underscoring the need for optimized system designs and storage strategies.

Overview of Interactive Analytical Processing in Big Data Systems: A Cross Industry Study of MapReduce Workloads

The paper presented by Chen, Alspaugh, and Katz offers an empirical analysis of MapReduce workloads across diversified industries. This paper specifically examines emerging workload behaviors by analyzing production traces from Facebook and Cloudera customers in sectors including e-commerce, telecommunications, media, and retail. The research highlights new, interactive workloads and revisits conventional assumptions regarding MapReduce’s functioning.

Key Contributions and Findings

  1. Workload Characterization:
    • The paper identifies a novel category of MapReduce workloads characterized by interactive and semi-streaming jobs, moving away from the traditional batch-oriented architecture. These jobs handle smaller data sizes (MB to GB) and have shorter durations, reflecting demands for real-time analysis.
    • The diversity observed within these workloads negates prior typical assumptions such as uniform data access and substantial job sizes.
  2. Industry Application:
    • By evaluating over a year’s-worth of data from different industries, the research successfully extends the understanding of MapReduce utilities beyond pure technological contexts. This aids in identifying new use cases across domains.
  3. Data Access Patterns:
    • The workloads reveal skewness in data access frequencies, often following a Zipf distribution. The majority of jobs access smaller files, indicating the potential for a tiered storage or cache architecture to optimize performance.
  4. Temporal and Capacity Considerations:
    • All analyzed workloads experience bursty, unpredictable submission rates, as demonstrated by a significant peak-to-median ratio. Such behaviors challenge system designers to address real-time load fluctuations and optimize across multi-tenant environments.
  5. Computation Patterns:
    • Different frameworks leverage MapReduce, with a significant portion of workload activity deriving from query-like frameworks such as Hive and Pig. This highlights the necessity for optimized support for a variety of programming frameworks that process data interactively.

Implications and Future Directions

This paper's findings challenge traditional perspectives on MapReduce-oriented systems and suggest several research and engineering directions:

  • System Design and Optimization: Recognizing the prevalence of interactive workloads necessitates a reevaluation of storage strategies (e.g., implementing cache policies that accommodate data access skewness) and optimizations towards query-like frameworks.
  • Benchmark Development: The complexity of real-world workloads underscores the need for a comprehensive, TPC-style benchmark that accurately depicts a wide array of operational conditions. This benchmark would serve as a robust tool for performance comparison and system evaluation across different industries.
  • Theoretical and Practical Expansion: Semantically connecting MapReduce to traditional RDBMS paradigms opens avenues for applying existing query optimization techniques to big data contexts, thus improving efficiency and response times in large-scale data processing systems.

In conclusion, this cross-industry paper contributes significantly to understanding the evolving landscape of MapReduce workloads. While demonstrating the inadequacy of established assumptions, it lays a foundation for advancing system designs and methodologies that align with real-world application requirements in data-intensive environments. Future investigations should expand on empirical analysis and workload modeling to refine benchmarking strategies further and enhance resource management in diverse industrial applications.