Performance of Distributed File Systems on Cloud Computing Environment: An Evaluation for Small-File Problem (2312.17524v1)
Abstract: Various performance characteristics of distributed file systems have been well studied. However, the performance efficiency of distributed file systems on small-file problems with complex machine learning algorithms scenarios is not well addressed. In addition, demands for unified storage of big data processing and high-performance computing have been crucial. Hence, developing a solution combining high-performance computing and big data with shared storage is very important. This paper focuses on the performance efficiency of distributed file systems with small-file datasets. We propose an architecture combining both high-performance computing and big data with shared storage and perform a series of experiments to investigate the performance of these distributed file systems. The result of the experiments confirms the applicability of the proposed architecture in terms of complex machine learning algorithms.
- G. C. Fox, J. Qiu, S. Kamburugamuve, S. Jha, and A. Luckow, “HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack,” in 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid). IEEE, pp. 1057–1066.
- D. Moise, “Experiences with Performing MapReduce Analysis of Scientific Data on HPC Platforms,” in DIDC ’16: Proceedings of the ACM International Workshop on Data-Intensive Distributed Computing, Cray Inc. New York, New York, USA: ACM, Jun. 2016, pp. 11–18.
- D. Zhao, Z. Zhang, X. Zhou, T. Li, K. Wang, D. Kimpe, P. Carns, R. Ross, and I. Raicu, “FusionFS: Toward supporting data-intensive scientific applications on extreme-scale high-performance computing systems,” in 2014 IEEE International Conference on Big Data (Big Data). IEEE, pp. 61–70.
- A. Bhat, N. S. Islam, X. Lu, M. Wasi-ur Rahman, D. Shankar, and D. K. DK Panda, “A Plugin-Based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS,” in Big Data Benchmarks, Performance Optimization, and Emerging Hardware. Cham: Springer International Publishing, Jan. 2016, pp. 119–132.
- P. Xuan, J. Denton, P. K. Srimani, R. Ge, and F. Luo, “Big data analytics on traditional HPC infrastructure using two-level storage,” in the 2015 International Workshop. New York, New York, USA: ACM Press, 2015, pp. 1–8.
- B. T. Rao and L. S. S. Reddy, “Survey on Improved Scheduling in Hadoop MapReduce in Cloud Environments,” arXiv.org, Jul. 2012.
- D. Borthakur, “The hadoop distributed file system: Architecture and design,” Hadoop Project Website, 2007.
- M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, “Spark: cluster computing with working sets,” in HotCloud’10: Proceedings of the 2nd USENIX conference on Hot topics in cloud computing, University of California, Berkeley. USENIX Association, Jun. 2010, pp. 10–10.
- N. Chaimov, A. Malony, S. Canon, C. Iancu, K. Z. Ibrahim, and J. Srinivasan, “Scaling Spark on HPC Systems,” in HPDC ’16: Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing, Lawrence Berkeley National Laboratory. New York, New York, USA: ACM, May 2016, pp. 97–110.
- M. Daz, C. Martn, and B. Rubio, “State-of-the-art, challenges, and open issues in the integration of Internet of things and cloud computing,” Journal of Network and Computer Applications, vol. 67, no. C, pp. 99–117, May 2016.
- Lustre to DAOS: Machine Learning on Intel’s Platform.
- Spider – the Center-Wide Lustre File System.
- W. Yu, R. Noronha, S. Liang, and D. K. Panda, “Benefits of high speed interconnects to cluster file systems: a case study with lustre,” in IPDPS’06: Proceedings of the 20th international conference on Parallel and distributed processing, Ohio State University. IEEE Computer Society, Apr. 2006, pp. 273–273.
- J. Dean and S. Ghemawat, “MapReduce: simplified data processing on large clusters,” Communications of the ACM, vol. 51, no. 1, pp. 107–113, Jan. 2008.
- M. Zaharia, D. Borthakur, J. Sen Sarma, K. Elmeleegy, S. Shenker, and I. Stoica, “Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling,” in EuroSys ’10: Proceedings of the 5th European conference on Computer systems, University of California, Berkeley. New York, New York, USA: ACM, Apr. 2010, pp. 265–278.
- M. Wasi-ur Rahman, X. Lu, N. S. Islam, R. Rajachandrasekar, and D. K. Panda, “High-Performance Design of YARN MapReduce on Modern HPC Clusters with Lustre and RDMA,” in IPDPS ’15: Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium. IEEE Computer Society, May 2015, pp. 291–300.
- T. Zhao, Z. Zhang, and X. Ao, “Application Performance Analysis of Distributed File Systems under Cloud Computing Environment,” Information Science and Control, pp. 152–155, 2015.
- X. Lu, M. W. U. Rahman, N. Islam, D. Shankar, and D. K. Panda, “Accelerating Spark with RDMA for Big Data Processing: Early Experiences,” in HOTI ’14: Proceedings of the 2014 IEEE 22nd Annual Symposium on High-Performance Interconnects. IEEE Computer Society, Aug. 2014, pp. 9–16.
- D. Shankar, X. Lu, M. Wasi-ur Rahman, N. Islam, and D. K. Panda, “Characterizing and benchmarking stand-alone Hadoop MapReduce on modern HPC clusters,” The Journal of Supercomputing, Jun. 2016.
- Intel Corporation. (2015) Intel Rolls Out Enhanced Lustre File System.
- ——. (2015) Lustre at the Core of HPC and Big Data Convergence.
- (2015) Seagate Apache Hadoop on Lustre Connector.
- C. McDonald. (2015) Parallel and Iterative Processing for Machine Learning Recommendations with Spark.
- H. Li, A. Ghodsi, M. Zaharia, and E. Baldeschwieler, “Tachyon: Memory throughput i/o for cluster computing frameworks,” memory, 2013.
- J. Sparks, H. Pritchard, and M. Dumler, “The Cray Framework for Hadoop for the Cray XC30,” Cray User Group Conference (CUG’14), 2014.
- The Google file system, 2003.
- The Hadoop Distributed File System, 2010.
- F. Wang, S. Oral, G. Shipman, and O. Drokin, “Understanding lustre filesystem internals,” 2009.
- D. Moise, G. Antoniu, and L. Bougé, “Improving the Hadoop map/reduce framework to support concurrent appends through the BlobSeer BLOB management system,” 2010.
- V. K. Vavilapalli, S. Seth, B. Saha, C. Curino, O. O’Malley, S. Radia, B. Reed, E. Baldeschwieler, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, T. Graves, J. Lowe, and H. Shah, “Apache Hadoop YARN,” in the 4th annual Symposium. New York, New York, USA: ACM Press, 2013, pp. 1–16.
- W. Xu, W. Luo, and N. Woodward, “Analysis and optimization of data import with hadoop,” Parallel and Distributed, 2012.
- S. Kipp. (2012) Exponential bandwidth growth and cost declines.
- D. J. Law, W. W. Diab, A. Healey, and S. B. Carlson, “IEEE 802.3 industry connections Ethernet bandwidth assessment,” 2012.
- T. P. Morgan, “InfiniBand Too Quick For Ethernet To Kill,” Apr. 2015.
- V. Meshram, X. Ouyang, and D. K. Panda, “Minimizing Lookup RPCs in Lustre File System using Metadata Delegation at Client Side,” Tech Rep OSU-CISRC-7/11, 2011.
- D. M. Stearman, “ZFS on RBODs Leveraging RAID Controllers for Metrics and Enclosure Management,” 2015.
- R. Brueckner, “Building a CIFS/NFS Gateway to Lustre - insideHPC,” Oct. 2014.
- K. V. Shvachko, “HDFS scalability: the limits to growth,” ; login:: the magazine of USENIX SAGE, vol. 35, no. 2, pp. 6–16, 2010.
- T. White. (2009) The Small Files Problem. [Online]. Available: http://blog.cloudera.com/blog/2009/02/the-small-files-problem/
- X. Liu, J. Han, Y. Zhong, C. Han, and X. He, “Implementing WebGIS on Hadoop: A case study of improving small file I/O performance on HDFS,” in 2009 IEEE International Conference on Cluster Computing and Workshops. IEEE, 2009, pp. 1–8.
- M. Pershin. Intel Lustre Data on MDT/Small File I/O.
- S. Ihara. Lustre Metadata Fundamental Benchmark and Performance.
- Criteo. (2014) Kaggle display advertising challenge. [Online]. Available: http://labs.criteo.com/downloads/2014-kaggle-display-advertising-challenge-dataset