Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Analyzing Large-Scale, Distributed and Uncertain Data (1712.01817v1)

Published 5 Dec 2017 in cs.DB

Abstract: The exponential growth of data in current times and the demand to gain information and knowledge from the data present new challenges for database researchers. Known database systems and algorithms are no longer capable of effectively handling such large data sets. MapReduce is a novel programming paradigm for processing distributable problems over large-scale data using a computer cluster. In this work we explore the MapReduce paradigm from three different angles. We begin by examining a well-known problem in the field of data mining: mining closed frequent itemsets over a large dataset. By harnessing the power of MapReduce, we present a novel algorithm for mining closed frequent itemsets that outperforms existing algorithms. Next, we explore one of the fundamental implications of "Big Data": The data is not known with complete certainty. A probabilistic database is a relational database with the addendum that each tuple is associated with a probability of its existence. A natural development of MapReduce is of a distributed relational database management system, where relational calculus has been reduced to a combination of map and reduce function. We take this development a step further by proposing a query optimizer over distributed, probabilistic database. Finally, we analyze the best known implementation of MapReduce called Hadoop, aiming to overcome one of its major drawbacks: it does not directly support the explicit specification of the data repeatedly processed throughout different jobs.Many data-mining algorithms, such as clustering and association-rules require iterative computation: the same data are processed again and again until the computation converges or a stopping condition is satisfied. We propose a modification to Hadoop such that it will support efficient access to the same data in different jobs.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (1)
  1. Yaron Gonen (1 paper)

Summary

We haven't generated a summary for this paper yet.