A Solution to the Network Challenges of Data Recovery in Erasure-coded Distributed Storage Systems: A Study on the Facebook Warehouse Cluster (1309.0186v1)

Published 1 Sep 2013 in cs.NI, cs.DC, cs.IT, and math.IT

Abstract: Erasure codes, such as Reed-Solomon (RS) codes, are being increasingly employed in data centers to combat the cost of reliably storing large amounts of data. Although these codes provide optimal storage efficiency, they require significantly high network and disk usage during recovery of missing data. In this paper, we first present a study on the impact of recovery operations of erasure-coded data on the data-center network, based on measurements from Facebook's warehouse cluster in production. To the best of our knowledge, this is the first study of its kind available in the literature. Our study reveals that recovery of RS-coded data results in a significant increase in network traffic, more than a hundred terabytes per day, in a cluster storing multiple petabytes of RS-coded data. To address this issue, we present a new storage code using our recently proposed "Piggybacking" framework, that reduces the network and disk usage during recovery by 30% in theory, while also being storage optimal and supporting arbitrary design parameters. The implementation of the proposed code in the Hadoop Distributed File System (HDFS) is underway. We use the measurements from the warehouse cluster to show that the proposed code would lead to a reduction of close to fifty terabytes of cross-rack traffic per day.

Citations (297)

View on Semantic Scholar

Summary

The paper demonstrates that RS-coded recovery operations generate over 100 TB of daily cross-rack traffic, heavily stressing data center networks.
It employs empirical analysis from Facebook's warehouse cluster, revealing that erasure-coding with RS codes significantly increases network and disk resource usage.
The proposed Piggybacking framework mitigates these issues by reducing network and disk usage by approximately 30%, offering a practical solution for storage optimization.

Network Effects of Erasure-Codes in Data Centers: A Study on the Facebook Warehouse Cluster and a Proposed Solution

The utilization of erasure codes, particularly Reed-Solomon (RS) codes, in data centers has become a prevalent strategy for enhancing storage efficiency. This paper conducts an in-depth analysis of the implications of RS-coded recovery operations on network infrastructure, specifically within Facebook's warehouse cluster. The analysis reveals that while RS codes offer optimal storage efficiency, the recovery of missing data incurs significant network traffic, leveraging the system's disks and network resources extensively.

Numerical Analysis and Observations

The paper, grounded in empirical measurements from Facebook's data center environment, documents that RS-coded data recovery operations significantly burden the network, pushing daily network traffic beyond a hundred terabytes. This is consequential in a cluster maintaining millions of RS-coded data. The RS code structures chosen here, exemplified by a configuration such as (k=10, r=4), manifest increased resource demands owing to requisite data recovery mechanisms that scale with data volume. The data indicates a startling cross-rack network usage surge due to these activities.

Proposed Framework: Piggybacking

To mitigate the challenges posed by RS-coded recovery operations, the authors present an innovative storage code based on a new Piggybacking framework. This framework purports to decrease network and disk usage during data recovery by approximately 30% without compromising on storage efficiency. The authors emphasize that the proposed code is structured on the Hadoop Distributed File System (HDFS) and demonstrate, through the integration of their solution, a potential attenuation of nearly fifty terabytes of cross-rack traffic daily.

Implications for Data Centers

Implementing piggybacked codes provides a potential pathway to optimize existing distributed storage system infrastructures and enhance operational efficiency. The code maintains the vital properties of maximum distance separability (MDS), storage optimality, and allows flexibility in design parameters. The proposed enhancement supports broader deployment opportunities of erasure codes across clusters, allowing extensive cost savings related to storage capacity and efficiency.

Future Directions

The paper’s findings suggest several avenues for subsequent investigation. Given the increasing scale of data operations, there is an ongoing need to explore further reductions in recovery bandwidth while preserving fault tolerance within elastic data environments. The trajectory of this research underscores the importance of continuing to cross-evaluate such frameworks in different settings to better understand the nuanced characteristics of data-coded storage and network demands.

By addressing these network-induced challenges, the paper contributes significantly to the discourse regarding data recovery efficiency in expansive data storage systems, offering a pertinent solution that mitigates network overload while sustaining reliability and availability within the infrastructure. Future work will undoubtedly expand on the initial success of the Piggybacking framework, potentially unraveling even broader applications in diverse data center environments.

PDF Markdown