Big Data Computing and Clouds: Trends and Future Directions

Published 17 Dec 2013 in cs.DC | (1312.4722v2)

Abstract: This paper discusses approaches and environments for carrying out analytics on Clouds for Big Data applications. It revolves around four important areas of analytics and Big Data, namely (i) data management and supporting architectures; (ii) model development and scoring; (iii) visualisation and user interaction; and (iv) business models. Through a detailed survey, we identify possible gaps in technology and provide recommendations for the research community on future directions on Cloud-supported Big Data computing and analytics solutions.

Abstract PDF Upgrade to Chat

Citations (859)

View on Semantic Scholar

Summary

The paper's main contribution is its comprehensive analysis of how cloud computing enhances Big Data analytics through optimized data management and scalable model building.
It evaluates frameworks like Hadoop MapReduce and APIs such as the Google Prediction API, illustrating methods that improve data integration and predictive modeling.
The study recommends advancements in interactive visualization and business models, paving the way for efficient, cost-effective analytics services.

Big Data Computing and Clouds: Trends and Future Directions

The paper "Big Data Computing and Clouds: Trends and Future Directions" presents a comprehensive analysis of the interplay between Big Data analytics and Cloud computing, elucidating the current state and prospective trajectories in this domain. The paper is structured around four principal areas: data management and supporting architectures, model development and scoring, visualization and user interaction, and business models. It identifies emergent gaps, offers recommendations, and explores future directions for research and application.

Data Management

The effective management of data is paramount in the context of Big Data analytics. The paper discusses various aspects of data storage, integration, and processing, emphasizing the criticality of locality in data-intensive computations. Solutions such as the Google File System (GFS), Hadoop Distributed File System (HDFS), and POSIX-based cluster file systems are evaluated for their capacity to handle massive datasets. Notably, Hadoop and its MapReduce programming model are highlighted for their efficacy in leveraging data locality to optimize performance. The survey emphasizes that different analytics applications necessitate distinct data handling approaches, whether leveraging structured data in relational databases or unstructured data requiring NoSQL solutions.

Key issues discussed include:

Data Storage: Innovations like Amazon Simple Storage Service (S3) and object-store capabilities are analyzed for their scalability and redundancy benefits, yet the batch-job nature of current Cloud solutions often lacks real-time interactivity.
Data Integration: The complexity of integrating diverse data sources is addressed, with solutions like Apache Hive and in-database processing being pivotal in minimizing overheads and silos.
Data Processing: The prominence of the MapReduce framework is underscored, alongside advanced systems like Apache Mahout for machine learning on Cloud platforms.

Model Building and Scoring

The paper explores the methodologies for constructing and validating analytical models using Cloud resources. Techniques for model building and scoring leverage infrastructures as a service (IaaS) and software as a service (SaaS) models, with APIs facilitating tasks such as predictive analytics. The Google Prediction API and Apache Mahout project are notable mentions, offering scalable tools for machine learning and data analysis.

Highlights include:

Predictive Models: The use of Predictive Model Markup Language (PMML) to define and exchange information about predictive models and the deployment of these models in the Cloud.
Scalability: Emphasis on the Cloud's ability to scale resources dynamically, ensuring models can be built and validated efficiently even with growing datasets.

Visualization and User Interaction

Visualizing the results of Big Data analysis is crucial for interpretation and decision-making. The paper critiques the current interactivity limitations of Cloud-based analytics tools. Enhancements are proposed to improve the user experience, particularly through the integration of sophisticated visualization techniques that require both software optimizations and hardware advancements, such as the use of high-resolution display walls.

Key findings involve:

Interactive Visualisation: The inadequacy of batch processing and the demand for more immediate visual feedback in analytics workflows.
Customization: Tools like IBM's ManyEyes and FusionCharts enable users to visualize data in various formats, aiding in descriptive, predictive, and prescriptive analyses.

Business Models and Non-Technical Challenges

Beyond the technical landscape, the paper investigates business models that could facilitate the adoption of Cloud-based Big Data analytics. It considers the potential of providing analytics as a service (AaaS) or Big Data as a service (BDaaS), where services range from hosting customer analytics on shared platforms to offering full-stack analytics solutions.

Key discussions include:

Service Level Agreements (SLAs): The intricacies of defining SLAs for analytics services, which must account for data quality, execution time, and reliability.
Cost-Effectiveness: Multi-tenancy solutions to distribute costs and provide affordable analytics capabilities to smaller organizations.

Future Directions

Speculative insights into future developments stress the need for enhanced interactivity in analytics tools, robust integration frameworks, and standards for model validation and exchange. Addressing these aspects is pivotal for fostering a competitive market where providers can offer analytics services without vendor lock-in, enabling cost and performance-driven decisions.

Implications

The practical implications of this research are vast, impacting how organizations harness Big Data for competitive advantage. The theoretical contributions lay a foundation for a coherent framework that integrates disparate data sources, optimizes large-scale data processing, and leverages Cloud resources efficiently. The bold yet contested claim is the holistic synthesis of technical improvements and business strategies that could democratize access to powerful analytics capabilities.

Conclusion

In conclusion, the paper offers a detailed exploration of the symbiosis between Big Data analytics and Cloud computing, identifying critical challenges and outlining a path forward for researchers and practitioners alike. By addressing gaps in data management, model building, visualization, and business application, the paper elucidates how Cloud technologies can be leveraged to enhance the effectiveness and accessibility of Big Data analytics.