Unraveling codes: fast, robust, beyond-bound error correction for DRAM (2401.10688v2)
Abstract: Generalized Reed-Solomon (RS) codes are a common choice for efficient, reliable error correction in memory and communications systems. These codes add $2t$ extra parity symbols to a block of memory, and can efficiently and reliably correct up to $t$ symbol errors in that block. Decoding is possible beyond this bound, but it is imperfectly reliable and often computationally expensive. Beyond-bound decoding is an important problem to solve for error-correcting Dynamic Random Access Memory (DRAM). These memories are often designed so that each access touches two extra memory devices, so that a failure in any one device can be corrected. But system architectures increasingly require DRAM to store metadata in addition to user data. When the metadata replaces parity data, a single-device failure is then beyond-bound. An error-correction system can either protect each access with a single RS code, or divide it into several segments protected with a shorter code, usually in an Interleaved Reed-Solomon (IRS) configuration. The full-block RS approach is more reliable, both at correcting errors and at preventing silent data corruption (SDC). The IRS option is faster, and is especially efficient at beyond-bound correction of single- or double-device failures. Here we describe a new family of "unraveling" Reed-Solomon codes that bridges the gap between these options. Our codes are full-block generalized RS codes, but they can also be decoded using an IRS decoder. As a result, they combine the speed and beyond-bound correction capabilities of interleaved codes with the robustness of full-block codes, including the ability of the latter to reliably correct failures across multiple devices. We show that unraveling codes are an especially good fit for high-reliability DRAM error correction.
- AMD, “BIOS and kernel developer’s guide (BKDG) for AMD family 15h models 00h-0Fh processors,” https://www.amd.com/system/files/TechDocs/42301_15h_Mod_00h-0Fh_BKDG.pdf, 2013.
- M. V. Beigi, Y. Cao, S. Gurumurthi, C. Recchia, A. Walton, and V. Sridharan, “A systematic study of DDR4 DRAM faults in the field,” in 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 2023, pp. 991–1002.
- T. Bennett, “Chip Guard ECC: An efficient, low latency method,” 2023. [Online]. Available: https://arxiv.org/abs/2301.07271
- E. R. Berlekamp, “Non-binary BCH decoding,” North Carolina State University. Dept. of Statistics, Tech. Rep., 1966.
- D. Bleichenbacher, A. Kiayias, and M. Yung, “Decoding of interleaved Reed Solomon codes over noisy data,” in Automata, Languages and Programming: 30th International Colloquium, ICALP 2003 Eindhoven, The Netherlands, June 30–July 4, 2003 Proceedings 30. Springer, 2003, pp. 97–108.
- Z. Cheng, S. Han, P. P. Lee, X. Li, J. Liu, and Z. Li, “An in-depth correlative study between DRAM errors and server failures in production data centers,” in 2022 41st International Symposium on Reliable Distributed Systems (SRDS). IEEE, 2022, pp. 262–272.
- R. Chien, “Cyclic decoding procedures for Bose-Chaudhuri-Hocquenghem codes,” IEEE Transactions on information theory, vol. 10, no. 4, pp. 357–363, 1964.
- K. Criss, K. Bains, R. Agarwal, T. Bennett, T. Grunzke, J. K. Kim, H. Chung, and M. Jang, “Improving memory reliability by bounding DRAM faults: DDR5 improved reliability features,” in The International Symposium on Memory Systems, 2020, pp. 317–322.
- T. J. Dell, “A white paper on the benefits of chipkill-correct ECC for PC server main memory,” IBM Microelectronics division, vol. 11, no. 1-23, pp. 5–7, 1997.
- H. D. Dixit, S. Pendharkar, M. Beadon, C. Mason, T. Chakravarthy, B. Muthiah, and S. Sankar, “Silent data corruptions at scale,” 2021.
- G. Forney, “On decoding BCH codes,” IEEE Transactions on information theory, vol. 11, no. 4, pp. 549–557, 1965.
- S. Gao and T. Mateer, “Additive fast Fourier transforms over finite fields,” IEEE Transactions on Information Theory, vol. 56, no. 12, pp. 6265–6272, 2010.
- S.-L. Gong, J. Kim, S. Lym, M. Sullivan, H. David, and M. Erez, “DUO: Exposing on-chip redundancy to rank-level ECC for high reliability,” in 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2018, pp. 683–695.
- D. Gorenstein, W. W. Peterson, and N. Zierler, “Two-error correcting Bose-Chaudhuri codes are quasi-perfect,” Information and Control, vol. 3, no. 3, pp. 291–294, 1960.
- JEDEC Solid State Technology Association, “JEDEC Standard No. 79-4: DDR4 SDRAM,” https://www.jedec.org/standards-documents/docs/jesd79-4b, 2017.
- JEDEC Solid State Technology Association, “JEDEC Standard No. 79-5: DDR5 SDRAM,” https://www.jedec.org/standards-documents/docs/jesd79-5b, 2022.
- W. K. Kadir, H.-Y. Lin, and E. Rosnes, “Efficient interpolation-based decoding of Reed-Solomon codes,” in 2023 IEEE International Symposium on Information Theory (ISIT). IEEE, 2023, pp. 997–1002.
- Y. Katayama and S. Morioka, “One-shot Reed-Solomon decoding for high-performance dependable systems,” in Proceeding International Conference on Dependable Systems and Networks. DSN 2000. IEEE, 2000, pp. 390–399.
- J. Kim, M. Sullivan, and M. Erez, “Bamboo ECC: Strong, safe, and flexible codes for reliable computer memory,” in 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2015, pp. 101–112.
- Y. Kim, R. Daly, J. Kim, C. Fallin, J. H. Lee, D. Lee, C. Wilkerson, K. Lai, and O. Mutlu, “Flipping bits in memory without accessing them: An experimental study of DRAM disturbance errors,” ACM SIGARCH Computer Architecture News, vol. 42, no. 3, pp. 361–372, 2014.
- Y. Komamiya, “Application of logical mathematics to information theory,” p. 437:3, 1953.
- V. Y. Krachkovsky and Y. X. Lee, “Decoding of parallel Reed-Solomon codes with applications to product and concatenated codes,” in Proceedings. 1998 IEEE International Symposium on Information Theory (Cat. No. 98CH36252). IEEE, 1998, p. 55.
- C. Li, Y. Zhang, J. Wang, H. Chen, X. Liu, T. Huang, L. Peng, S. Zhou, L. Wang, and S. Ge, “From correctable memory errors to uncorrectable memory errors: what error bits tell,” in SC22: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2022, pp. 01–14.
- S.-J. Lin, T. Y. Al-Naffouri, and Y. S. Han, “FFT algorithm for binary extension finite fields and its application to Reed–Solomon codes,” IEEE Transactions on Information Theory, vol. 62, no. 10, pp. 5343–5358, 2016.
- S.-J. Lin, W.-H. Chung, and Y. S. Han, “Novel polynomial basis and its application to Reed-Solomon erasure codes,” in 2014 IEEE 55th annual symposium on foundations of computer science. IEEE, 2014, pp. 316–325.
- E. Manzhosov, A. Hastings, M. Pancholi, R. Piersma, M. Ziad, and S. Sethumadhavan, “Revisiting residue codes for modern memories,” in 2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO). Los Alamitos, CA, USA: IEEE Computer Society, oct 2022, pp. 73–90. [Online]. Available: https://doi.ieeecomputersociety.org/10.1109/MICRO56248.2022.00020
- J. Massey, “Shift-register synthesis and BCH decoding,” IEEE transactions on Information Theory, vol. 15, no. 1, pp. 122–127, 1969.
- R. McEliece and L. Swanson, “On the decoder error probability for reed-solomon codes (corresp.),” IEEE Transactions on Information Theory, vol. 32, no. 5, pp. 701–703, 1986.
- J. Meza, Q. Wu, S. Kumar, and O. Mutlu, “Revisiting memory errors in large-scale production data centers: Analysis and modeling of new trends from the field,” in 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. IEEE, 2015, pp. 415–426.
- S. Puchinger and J. Rosenkilde né Nielsen, “Decoding of interleaved Reed-Solomon codes using improved power decoding,” in 2017 IEEE International Symposium on Information Theory (ISIT). IEEE, 2017, pp. 356–360.
- I. S. Reed and G. Solomon, “Polynomial codes over certain finite fields,” Journal of the society for industrial and applied mathematics, vol. 8, no. 2, pp. 300–304, 1960.
- G. Schmidt, V. Sidorenko, and M. Bossert, “Collaborative decoding of interleaved Reed-Solomon codes and concatenated code designs,” CoRR, vol. abs/cs/0610074, 2006. [Online]. Available: http://arxiv.org/abs/cs/0610074
- G. Schmidt, V. R. Sidorenko, and M. Bossert, “Syndrome decoding of Reed-Solomon codes beyond half the minimum distance based on shift-register synthesis,” IEEE Transactions on Information Theory, vol. 56, no. 10, pp. 5245–5252, 2010.
- B. Schroeder, E. Pinheiro, and W.-D. Weber, “DRAM errors in the wild: a large-scale field study,” ACM SIGMETRICS Performance Evaluation Review, vol. 37, no. 1, pp. 193–204, 2009.
- M. Sudan, “Decoding of Reed Solomon codes beyond the error-correction bound,” Journal of complexity, vol. 13, no. 1, pp. 180–193, 1997.
- N. Tang and Y. S. Han, “New decoding of Reed-Solomon codes based on FFT and modular approach,” arXiv preprint arXiv:2207.11079, 2022.