Spinner: Scalable Graph Partitioning in the Cloud (1404.3861v2)

Published 15 Apr 2014 in cs.DC

Abstract: Several organizations, like social networks, store and routinely analyze large graphs as part of their daily operation. Such graphs are typically distributed across multiple servers, and graph partitioning is critical for efficient graph management. Existing partitioning algorithms focus on finding graph partitions with good locality, but disregard the pragmatic challenges of integrating partitioning into large-scale graph management systems deployed on the cloud, such as dealing with the scale and dynamicity of the graph and the compute environment. In this paper, we propose Spinner, a scalable and adaptive graph partitioning algorithm based on label propagation designed on top of the Pregel model. Spinner scales to massive graphs, produces partitions with locality and balance comparable to the state-of-the-art and efficiently adapts the partitioning upon changes. We describe our algorithm and its implementation in the Pregel programming model that makes it possible to partition billion-vertex graphs. We evaluate Spinner with a variety of synthetic and real graphs and show that it can compute partitions with quality comparable to the state-of-the art. In fact, by using Spinner in conjunction with the Giraph graph processing engine, we speed up different applications by a factor of 2 relative to standard hash partitioning.

Citations (76)

View on Semantic Scholar

Summary

The paper introduces Spinner, a scalable and adaptive graph partitioning algorithm built on the Pregel model that balances partitioning quality and computational cost for massive cloud graphs.
Spinner significantly improves application speed in graph systems by up to 200% compared to hash partitioning and maintains strong locality and load balance on billion-vertex graphs.
Spinner's adaptive approach efficiently handles dynamic graph changes, reducing update times by over 85%, making it highly practical for real-world cloud-based graph applications.

Spinner: Scalable Graph Partitioning in the Cloud

The management and analysis of large-scale graphs are vital operations for organizations dealing with data-intensive applications, notably social networks, web traffic, or biological networks. Efficient graph partitioning is central to managing these large graphs, ensuring reduced computational costs and enhanced system scalability. However, the task becomes complex due to various pragmatic challenges, especially when integrating these partitioning techniques into large-scale graph management systems in cloud environments characterized by dynamism and scale. This is where the proposed approach, Spinner, becomes relevant.

Overview

Spinner is introduced as a scalable and adaptive graph partitioning algorithm designed to address challenges traditionally overlooked by existing algorithms. The authors build Spinner on the Pregel model, exploiting the label propagation algorithm (LPA) to balance scalability with partitioning quality. Unlike state-of-the-art methods that either incur high computational costs or necessitate a global graph view, Spinner leverages distributed computing principles to scale effectively to billion-vertex graphs without severely compromising on locality or balance.

The implementation of Spinner in Apache Giraph showcases its potential to partition massive-scale graphs efficiently. The primary focus is on achieving a trade-off between computational resource constraints and maintaining optimal partitioning metrics — two contradictory prerequisites that are crucial in cloud-based systems.

Numerical Results and Contributions

Spinner demonstrates a significant improvement when the authors evaluate it against other partitioning approaches. Experiments reveal that Spinner can speed up application processing in graph systems like Giraph by up to 200%, relative to the traditional hash partitioning method. Additionally, in various datasets of different sizes, ranging from millions to billions of vertices, Spinner achieves competitive locality (0.31 to 0.85 ratios of local edges to total edges) and balanced partitioning load (maximum normalized load within 1.02 to 1.05). These results affirm Spinner’s capability to handle adaptive graph environments efficiently.

Further, Spinner adapts to graph changes and fluctuations in compute resources with an efficiency that reduces update times by over 85% in certain scenarios. The algorithm optimizes resource use by computing new partitionings incrementally and avoiding expensive re-computation processes by factoring in previous graph states.

Implications and Future Developments

The practical implication of Spinner lies in its applicability to real-world scenarios where constant graph updates are a norm, especially for cloud-based, data-intensive applications. Spinner's efficiency in maintaining partition quality amid dynamic changes translates to reduced network traffic and balanced computational loads — elements that critically improve performance and reduce costs in graph management systems.

From a theoretical perspective, Spinner advances the graph partitioning domain by implementing a robust adaptive methodology, positioning itself as a viable alternative to hash partitioning in distributed environments. The incorporation of Spinner in cloud systems emphasizes the growing need for scalability and adaptability in data management frameworks.

Future developments in graph partitioning could explore further enhancements in Spinner's adaptive capabilities, incorporating machine learning models to predict optimal partitioning strategies based on graph evolution patterns. This advancement can pave the way for more autonomous graph systems capable of real-time partition adjustments with minimal human intervention.

In conclusion, Spinner addresses fundamental challenges in scalable graph partitioning through a distributed, adaptable approach, positioning itself as an instrumental advancement for cloud-based graph management systems. Its efficient balance between locality and load distribution in massive-scale environments offers significant enhancements over existing methods, enabling more streamlined operations in dynamically evolving data landscapes.

Related Papers

YouTube

Show All Videos