Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Towards Zero-Overhead Adaptive Indexing in Hadoop (1212.3480v1)

Published 14 Dec 2012 in cs.DB and cs.DC

Abstract: Several research works have focused on supporting index access in MapReduce systems. These works have allowed users to significantly speed up selective MapReduce jobs by orders of magnitude. However, all these proposals require users to create indexes upfront, which might be a difficult task in certain applications (such as in scientific and social applications) where workloads are evolving or hard to predict. To overcome this problem, we propose LIAH (Lazy Indexing and Adaptivity in Hadoop), a parallel, adaptive approach for indexing at minimal costs for MapReduce systems. The main idea of LIAH is to automatically and incrementally adapt to users' workloads by creating clustered indexes on HDFS data blocks as a byproduct of executing MapReduce jobs. Besides distributing indexing efforts over multiple computing nodes, LIAH also parallelises indexing with both map tasks computation and disk I/O. All this without any additional data copy in main memory and with minimal synchronisation. The beauty of LIAH is that it piggybacks index creation on map tasks, which read relevant data from disk to main memory anyways. Hence, LIAH does not introduce any additional read I/O-costs and exploit free CPU cycles. As a result and in contrast to existing adaptive indexing works, LIAH has a very low (or invisible) indexing overhead, usually for the very first job. Still, LIAH can quickly converge to a complete index, i.e. all HDFS data blocks are indexed. Especially, LIAH can trade early job runtime improvements with fast complete index convergence. We compare LIAH with HAIL, a state-of-the-art indexing technique, as well as with standard Hadoop with respect to indexing overhead and workload performance.

Citations (25)

Summary

We haven't generated a summary for this paper yet.