Boosting the Performance of Degraded Reads in RS-coded Distributed Storage Systems (2306.10528v1)
Abstract: Reed-Solomon (RS) codes have been increasingly adopted by distributed storage systems in place of replication,because they provide the same level of availability with much lower storage overhead. However, a key drawback of those RS-coded distributed storage systems is the poor latency of degraded reads, which can be incurred by data failures or hot spots,and are not rare in production environments. To address this issue, we propose a novel parallel reconstruction solution called APLS. APLS leverages all surviving source nodes to send the data needed by degraded reads and chooses light-loaded starter nodes to receive the reconstructed data of those degraded reads. Hence, the latency of the degraded reads can be improved.Prototyping-based experiments are conducted to compare APLS with ECPipe, the state-of-the-art solution of improving the latency of degraded reads. The experimental results demonstrate that APLS effectively reduces the latency, particularly under heavy or medium workloads.
- D. Ford, F. Labelle, F. I. Popovici, M. Stokely, V.-A. Truong, L. Barroso, C. Grimes, and S. Quinlan, “Availability in Globally Distributed Storage Systems,” in OSDI’10, 2010.
- K. V. Rashmi, N. B. Shah, D. Gu, H. Kuang, D. Borthakur, and K. Ramchandran, “A Solution to the Network Challenges of Data Recovery in Erasure-coded Distributed Storage Systems: A Study on the Facebook Warehouse Cluster,” in HotStorage’13, 2013.
- Y. L. Chen, S. Mu, J. Li, C. Huang, J. Li, A. Ogus, and D. Phillips, “Giza: Erasure Coding Objects across Global Data Centers,” in ATC’17, 2017.
- S. Mitra, R. Panta, M. R. Ra, and S. Bagchi, “Partial-parallel-repair (PPR): a distributed technique for repairing erasure coded storage,” in EUROSYS’16, 2016.
- R. Li, X. Li, P. P. C. Lee, and Q. Huang, “Repair Pipelining for Erasure-Coded Storage,” in ATC’17, 2017.
- B. Atikoglu, Y. Xu, E. Frachtenberg, S. Jiang, and M. Paleczny, “Workload Analysis of a Large-scale Key-value Store,” in SIGMETRICS’12, 2012.
- S. Di, D. Kondo, and F. Cappello, “Characterizing Cloud Applications on a Google Data Center,” in ICPP’13, 2013.
- Q. Huang, H. Gudmundsdottir, Y. Vigfusson, D. A. Freedman, K. Birman, and R. van Renesse, “Characterizing Load Imbalance in Real-World Networked Caches,” in HotNets’14, 2014.
- S. Novakovic, A. Daglis, E. Bugnion, B. Falsafi, and B. Grot, “An Analysis of Load Imbalance in Scale-out Data Serving,” in SIGMETRICS ’16, 2016.
- C. Lu, K. Ye, G. Xu, C. Z. Xu, and T. Bai, “Imbalance in the cloud: An analysis on Alibaba cluster trace,” in IEEE International Conference on Big Data, 2017.
- I. S. Reed and G. Solomon, “Polynomial Codes Over Certain Finite Fields,” Journal of the Society for Industrial and Applied Mathematics, 1960.
- S. Ghemawat, H. Gobioff, and S.-T. Leung, “The Google File System,” in SOSP’03, 2003.
- C. Huang, H. Simitci, Y. Xu, A. Ogus, B. Calder, P. Gopalan, J. Li, and S. Yekhanin, “Erasure Coding in Windows Azure Storage,” in ATC’12, 2012.
- M. Ovsiannikov, S. Rus, D. Reeves, P. Sutter, S. Rao, and J. Kelly, “The quantcast file system,” 2013.
- T. Harter, D. Borthakur, S. Dong, A. Aiyer, L. Tang, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau, “Analysis of HDFS Under HBase: A Facebook Messages Case Study,” in FAST’14, 2014.
- Y. Hu, Y. Wang, B. Liu, D. Niu, and C. Huang, “Latency reduction and load balancing in coded storage systems,” in SOCC’17, 2017.
- P. Institute and E. N. Power, “2016 Cost of Data Center Outages,” https://www.ponemon.org/blog/2016-cost-of-data-center-outages, 2016.
- D. Beaver, S. Kumar, H. C. Li, J. Sobel, and P. Vajgel, “Finding a needle in Haystack: facebook’s photo storage,” in OSDI’10, 2010.
- E. B. Nightingale, J. Elson, J. Fan, O. Hofmann, J. Howell, and Y. Suzue, “Flat Datacenter Storage,” in OSDI’12. USENIX, 2012.
- X. Li, “Repair Pipelining for Erasure-Coded Storage,” http://adslab.cse.cuhk.edu.hk/software/ecpipe/, 2017.
- Alibaba, “Alibab ECS,” https://www.alibabacloud.com/product/ecs, 2017.
- Linux, “tc(8) - Linux man page,” https://linux.die.net/man/8/tc, 2017.
- A. G. Dimakis, P. B. Godfrey, Y. Wu, M. J. Wainwright, and K. Ramchandran, “Network coding for distributed storage systems,” Transactions on Information Theory, 2010.
- K. M. Greenan, X. Li, and J. J. Wylie, “Flat XOR-based erasure codes in storage systems: Constructions, efficient recovery, and tradeoffs,” in MSST’10, 2010.
- L. Xiang, Y. Xu, J. C. S. Lui, and Q. Chang, “Optimal recovery of single disk failure in RDP code storage systems,” in SIGMETRICS’10, 2010.
- O. Khan, R. Burns, J. Plank, W. Pierce, and C. Huang, “Rethinking Erasure Codes for Cloud File Systems: Minimizing I/O for Recovery and Degraded Reads,” in FAST’12, 2012.
- Y. Hu, H. C. H. Chen, P. P. C. Lee, and Y. Tang, “NCCloud: applying network coding for the storage repair in a cloud-of-clouds,” in FAST’12, 2012.
- I. Tamo, Z. Wang, and J. Bruck, “Zigzag Codes: MDS Array Codes With Optimal Rebuilding,” Transactions on Information Theory, 2013.
- M. Sathiamoorthy, M. Asteris, D. Papailiopoulos, A. G. Dimakis, R. Vadali, S. Chen, and D. Borthakur, “XORing Elephants: Novel Erasure Codes for Big Data,” in VLDB’13, 2013.
- S. Xu, R. Li, P. P. C. Lee, Y. Zhu, L. Xiang, Y. Xu, and J. C. S. Lui, “Single Disk Failure Recovery for X-Code-Based Parallel Storage Systems,” IEEE Transactions on Computers, 2014.
- K. V. Rashmi, N. B. Shah, D. Gu, H. Kuang, D. Borthakur, and K. Ramchandran, “A ”hitchhiker’s” guide to fast and efficient data reconstruction in erasure-coded data centers,” in SIGCOMM’14, 2014.
- M. Xia, M. Saxena, M. Blaum, and D. A. Pease, “A Tale of Two Erasure Codes in HDFS,” in FAST’15, 2015.
- L. Pamies-Juarez, F. Blagojević, R. Mateescu, C. Gyuot, E. E. Gad, and Z. Bandić, “Opening the Chrysalis: On the Real Repair Performance of MSR Codes,” in FAST’16, 2016.
- F. André, A.-M. Kermarrec, E. Le Merrer, N. Le Scouarnec, G. Straub, and A. Van Kempen, “Archiving cold data in warehouses with clustered network coding,” in EUROSYS’14, 2014.
- K. V. Rashmi, P. Nakkiran, J. Wang, N. B. Shah, and K. Ramchandran, “Having your cake and eating it too: jointly optimal erasure codes for I/O, storage and network-bandwidth,” in FAST’15, 2015.
- S. Jiekak, A.-M. Kermarrec, N. Le Scouarnec, G. Straub, and A. Van Kempen, “Regenerating Codes: A System Perspective,” in SIGOPS’13, 2013.
- A. K. Dutta and R. Hasan, “How much does storage really cost? Towards a full cost accounting model for data storage,” in International Conference on Grid Economics and Business Models. Springer, 2013.
- Amazon, “Pricing of Amazon S3,” https://aws.amazon.com/s3/pricing, 2017.
- Y. Zhu, P. P. C. Lee, Y. Xu, Y. Hu, and L. Xiang, “On the Speedup of Recovery in Large-Scale Erasure-Coded Storage Systems,” Transactions on Parallel and Distributed Systems, 2014.
- Y. Zhu, J. Lin, P. P. C. Lee, and Y. Xu, “Boosting Degraded Reads in Heterogeneous Erasure-Coded Storage Systems,” IEEE Transactions on Computers, 2015.
- Y. Fu, J. Shu, and Z. Shen, “EC-FRM: An Erasure Coding Framework to Speed Up Reads for Erasure Coded Cloud Storage Systems,” in ICPP’15, 2015.
- Z. Shen, P. P. C. Lee, J. Shu, and W. Guo, “Encoding-Aware Data Placement for Efficient Degraded Reads in XOR-Coded Storage Systems,” in SRDS’16, 2016.
- K. Gardner, S. Zbarsky, S. Doroudi, M. Harcholbalter, and E. Hyytia, “Reducing Latency via Redundant Requests: Exact Analysis,” Acm Sigmetrics Performance Evaluation Review, 2015.
- V. Aggarwal, A. O. Al-Abbasi, J. Fan, and T. Lan, “Taming Tail Latency for Erasure-coded, Distributed Storage Systems,” 2017.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.