Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Optimization and analysis of large scale data sorting algorithm based on Hadoop (1506.00449v1)

Published 1 Jun 2015 in cs.DC

Abstract: When dealing with massive data sorting, we usually use Hadoop which is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. A common approach in implement of big data sorting is to use shuffle and sort phase in MapReduce based on Hadoop. However, if we use it directly, the efficiency could be very low and the load imbalance can be a big problem. In this paper we carry out an experimental study of an optimization and analysis of large scale data sorting algorithm based on hadoop. In order to reach optimization, we use more than 2 rounds MapReduce. In the first round, we use a MapReduce to take sample randomly. Then we use another MapReduce to order the data uniformly, according to the results of the first round. If the data is also too big, it will turn back to the first round and keep on. The experiments show that, it is better to use the optimized algorithm than shuffle of MapReduce to sort large scale data.

Citations (3)

Summary

We haven't generated a summary for this paper yet.