The Anatomy of Big Data Computing (1509.01331v1)

Published 4 Sep 2015 in cs.DC

Abstract: Advances in information technology and its widespread growth in several areas of business, engineering, medical and scientific studies are resulting in information/data explosion. Knowledge discovery and decision making from such rapidly growing voluminous data is a challenging task in terms of data organization and processing, which is an emerging trend known as Big Data Computing; a new paradigm which combines large scale compute, new data intensive techniques and mathematical models to build data analytics. Big Data computing demands a huge storage and computing for data curation and processing that could be delivered from on-premise or clouds infrastructures. This paper discusses the evolution of Big Data computing, differences between traditional data warehousing and Big Data, taxonomy of Big Data computing and underpinning technologies, integrated platform of Big Data and Clouds known as Big Data Clouds, layered architecture and components of Big Data Cloud and finally discusses open technical challenges and future directions.

Citations (197)

View on Semantic Scholar

Summary

The paper contrasts traditional data systems with Big Data paradigms by highlighting the 5Vs to justify real-time processing necessities.
The paper details a taxonomy of technologies including Hadoop, Apache Spark, and various NoSQL databases for managing diverse data streams.
The paper presents a layered architecture for Big Data Clouds and outlines a 4D framework to guide future research in analytics and decision-making.

An In-Depth Analysis of "The Anatomy of Big Data Computing"

This paper, authored by Kune et al., provides a comprehensive analysis of Big Data computing, addressing its evolution, characteristics, underlying technologies, and future directions. The authors begin by highlighting the disparity between traditional data warehousing systems and the more recent Big Data paradigms. This distinction is primarily driven by the 5Vs of Big Data: Volume, Velocity, Variety, Veracity, and Value, which contrast sharply with traditional data systems that focus on structured data and pre-determined analytics.

Big Data Characteristics and Traditional Data Systems

Traditional databases rely on structured, transaction-oriented data management employing OLTP and OLAP frameworks, aimed at handling smaller, consistent data volumes. In contrast, Big Data systems are designed for handling vast and varied data streams in real-time and near-real-time from diverse sources. These systems eschew the cleansing and transformation procedures typical of traditional systems, instead operating under Brewer’s CAP theorem and the BASE properties, emphasizing availability and partition tolerance over consistency.

The Taxonomy and Technologies of Big Data

The paper lays out a detailed taxonomy of Big Data, encompassing distributed file systems, open-source frameworks like Hadoop and Apache Spark, and the integration of various commercial frameworks provided by industry leaders such as Google and Amazon. This taxonomy also encapsulates programming models like MapReduce, diverse analytics methods, and the necessary security considerations to tackle the inherent vulnerabilities in Big Data systems.

The technologies enabling Big Data computing are vast and varied, including file management, query schedulers, NoSQL databases, and more. Key-value stores, document-oriented databases, and graph databases are investigated for their role in managing the diversity of data typical in Big Data environments.

Big Data in Cloud Computing

The authors articulate the convergence of Big Data and Cloud computing, resulting in the "Big Data Clouds" paradigm. This integration leverages the elastic and scalable nature of Cloud infrastructures to handle the extensive compute and storage demands of Big Data workloads. The paper discusses variations in deployment, such as public, private, and hybrid Big Data Clouds, each offering different levels of control, cost, and scalability.

Layered Architecture for Big Data Clouds

A layered architecture is presented, consisting of infrastructure, platform, fabric, and analytics layers. This structure emphasizes the fluidity and flexibility necessary to accommodate the dynamic requirements of Big Data processing. Each layer provides distinct functionalities, from cloud infrastructure and management to advanced analytics services that allow stakeholders to develop and deploy data-driven solutions efficiently.

Open Challenges and Future Directions

Addressing the future of Big Data computing, the authors identify several areas ripe for further research, encapsulated in their 4D research elements framework: Depository (storage technologies), Devise (platforms and modeling frameworks), Domain (specific applications), and Determine (analytics and decision-making processes). They stress the need to develop indexing mechanisms, privacy-preserving measures, and more robust statistical models to better harness and interpret the data flood.

Conclusion

The paper provides a detailed and structured exploration into Big Data computing's current state and trajectory. It conveys the critical role of emerging technologies and platforms in reshaping data analysis and decision-making across industries. The synthesis of Cloud resources and Big Data paradigms predicates a shift towards more integrated and scalable computing solutions. Moving forward, the ongoing research and development efforts highlighted in this paper will be pivotal in optimizing Big Data solutions to extract actionable insights efficiently.