Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
175 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Unraveling codes: fast, robust, beyond-bound error correction for DRAM (2401.10688v2)

Published 19 Jan 2024 in cs.IT, cs.AR, and math.IT

Abstract: Generalized Reed-Solomon (RS) codes are a common choice for efficient, reliable error correction in memory and communications systems. These codes add $2t$ extra parity symbols to a block of memory, and can efficiently and reliably correct up to $t$ symbol errors in that block. Decoding is possible beyond this bound, but it is imperfectly reliable and often computationally expensive. Beyond-bound decoding is an important problem to solve for error-correcting Dynamic Random Access Memory (DRAM). These memories are often designed so that each access touches two extra memory devices, so that a failure in any one device can be corrected. But system architectures increasingly require DRAM to store metadata in addition to user data. When the metadata replaces parity data, a single-device failure is then beyond-bound. An error-correction system can either protect each access with a single RS code, or divide it into several segments protected with a shorter code, usually in an Interleaved Reed-Solomon (IRS) configuration. The full-block RS approach is more reliable, both at correcting errors and at preventing silent data corruption (SDC). The IRS option is faster, and is especially efficient at beyond-bound correction of single- or double-device failures. Here we describe a new family of "unraveling" Reed-Solomon codes that bridges the gap between these options. Our codes are full-block generalized RS codes, but they can also be decoded using an IRS decoder. As a result, they combine the speed and beyond-bound correction capabilities of interleaved codes with the robustness of full-block codes, including the ability of the latter to reliably correct failures across multiple devices. We show that unraveling codes are an especially good fit for high-reliability DRAM error correction.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. AMD, “BIOS and kernel developer’s guide (BKDG) for AMD family 15h models 00h-0Fh processors,” https://www.amd.com/system/files/TechDocs/42301_15h_Mod_00h-0Fh_BKDG.pdf, 2013.
  2. M. V. Beigi, Y. Cao, S. Gurumurthi, C. Recchia, A. Walton, and V. Sridharan, “A systematic study of DDR4 DRAM faults in the field,” in 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA).   IEEE, 2023, pp. 991–1002.
  3. T. Bennett, “Chip Guard ECC: An efficient, low latency method,” 2023. [Online]. Available: https://arxiv.org/abs/2301.07271
  4. E. R. Berlekamp, “Non-binary BCH decoding,” North Carolina State University. Dept. of Statistics, Tech. Rep., 1966.
  5. D. Bleichenbacher, A. Kiayias, and M. Yung, “Decoding of interleaved Reed Solomon codes over noisy data,” in Automata, Languages and Programming: 30th International Colloquium, ICALP 2003 Eindhoven, The Netherlands, June 30–July 4, 2003 Proceedings 30.   Springer, 2003, pp. 97–108.
  6. Z. Cheng, S. Han, P. P. Lee, X. Li, J. Liu, and Z. Li, “An in-depth correlative study between DRAM errors and server failures in production data centers,” in 2022 41st International Symposium on Reliable Distributed Systems (SRDS).   IEEE, 2022, pp. 262–272.
  7. R. Chien, “Cyclic decoding procedures for Bose-Chaudhuri-Hocquenghem codes,” IEEE Transactions on information theory, vol. 10, no. 4, pp. 357–363, 1964.
  8. K. Criss, K. Bains, R. Agarwal, T. Bennett, T. Grunzke, J. K. Kim, H. Chung, and M. Jang, “Improving memory reliability by bounding DRAM faults: DDR5 improved reliability features,” in The International Symposium on Memory Systems, 2020, pp. 317–322.
  9. T. J. Dell, “A white paper on the benefits of chipkill-correct ECC for PC server main memory,” IBM Microelectronics division, vol. 11, no. 1-23, pp. 5–7, 1997.
  10. H. D. Dixit, S. Pendharkar, M. Beadon, C. Mason, T. Chakravarthy, B. Muthiah, and S. Sankar, “Silent data corruptions at scale,” 2021.
  11. G. Forney, “On decoding BCH codes,” IEEE Transactions on information theory, vol. 11, no. 4, pp. 549–557, 1965.
  12. S. Gao and T. Mateer, “Additive fast Fourier transforms over finite fields,” IEEE Transactions on Information Theory, vol. 56, no. 12, pp. 6265–6272, 2010.
  13. S.-L. Gong, J. Kim, S. Lym, M. Sullivan, H. David, and M. Erez, “DUO: Exposing on-chip redundancy to rank-level ECC for high reliability,” in 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).   IEEE, 2018, pp. 683–695.
  14. D. Gorenstein, W. W. Peterson, and N. Zierler, “Two-error correcting Bose-Chaudhuri codes are quasi-perfect,” Information and Control, vol. 3, no. 3, pp. 291–294, 1960.
  15. JEDEC Solid State Technology Association, “JEDEC Standard No. 79-4: DDR4 SDRAM,” https://www.jedec.org/standards-documents/docs/jesd79-4b, 2017.
  16. JEDEC Solid State Technology Association, “JEDEC Standard No. 79-5: DDR5 SDRAM,” https://www.jedec.org/standards-documents/docs/jesd79-5b, 2022.
  17. W. K. Kadir, H.-Y. Lin, and E. Rosnes, “Efficient interpolation-based decoding of Reed-Solomon codes,” in 2023 IEEE International Symposium on Information Theory (ISIT).   IEEE, 2023, pp. 997–1002.
  18. Y. Katayama and S. Morioka, “One-shot Reed-Solomon decoding for high-performance dependable systems,” in Proceeding International Conference on Dependable Systems and Networks. DSN 2000.   IEEE, 2000, pp. 390–399.
  19. J. Kim, M. Sullivan, and M. Erez, “Bamboo ECC: Strong, safe, and flexible codes for reliable computer memory,” in 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).   IEEE, 2015, pp. 101–112.
  20. Y. Kim, R. Daly, J. Kim, C. Fallin, J. H. Lee, D. Lee, C. Wilkerson, K. Lai, and O. Mutlu, “Flipping bits in memory without accessing them: An experimental study of DRAM disturbance errors,” ACM SIGARCH Computer Architecture News, vol. 42, no. 3, pp. 361–372, 2014.
  21. Y. Komamiya, “Application of logical mathematics to information theory,” p. 437:3, 1953.
  22. V. Y. Krachkovsky and Y. X. Lee, “Decoding of parallel Reed-Solomon codes with applications to product and concatenated codes,” in Proceedings. 1998 IEEE International Symposium on Information Theory (Cat. No. 98CH36252).   IEEE, 1998, p. 55.
  23. C. Li, Y. Zhang, J. Wang, H. Chen, X. Liu, T. Huang, L. Peng, S. Zhou, L. Wang, and S. Ge, “From correctable memory errors to uncorrectable memory errors: what error bits tell,” in SC22: International Conference for High Performance Computing, Networking, Storage and Analysis.   IEEE, 2022, pp. 01–14.
  24. S.-J. Lin, T. Y. Al-Naffouri, and Y. S. Han, “FFT algorithm for binary extension finite fields and its application to Reed–Solomon codes,” IEEE Transactions on Information Theory, vol. 62, no. 10, pp. 5343–5358, 2016.
  25. S.-J. Lin, W.-H. Chung, and Y. S. Han, “Novel polynomial basis and its application to Reed-Solomon erasure codes,” in 2014 IEEE 55th annual symposium on foundations of computer science.   IEEE, 2014, pp. 316–325.
  26. E. Manzhosov, A. Hastings, M. Pancholi, R. Piersma, M. Ziad, and S. Sethumadhavan, “Revisiting residue codes for modern memories,” in 2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO).   Los Alamitos, CA, USA: IEEE Computer Society, oct 2022, pp. 73–90. [Online]. Available: https://doi.ieeecomputersociety.org/10.1109/MICRO56248.2022.00020
  27. J. Massey, “Shift-register synthesis and BCH decoding,” IEEE transactions on Information Theory, vol. 15, no. 1, pp. 122–127, 1969.
  28. R. McEliece and L. Swanson, “On the decoder error probability for reed-solomon codes (corresp.),” IEEE Transactions on Information Theory, vol. 32, no. 5, pp. 701–703, 1986.
  29. J. Meza, Q. Wu, S. Kumar, and O. Mutlu, “Revisiting memory errors in large-scale production data centers: Analysis and modeling of new trends from the field,” in 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.   IEEE, 2015, pp. 415–426.
  30. S. Puchinger and J. Rosenkilde né Nielsen, “Decoding of interleaved Reed-Solomon codes using improved power decoding,” in 2017 IEEE International Symposium on Information Theory (ISIT).   IEEE, 2017, pp. 356–360.
  31. I. S. Reed and G. Solomon, “Polynomial codes over certain finite fields,” Journal of the society for industrial and applied mathematics, vol. 8, no. 2, pp. 300–304, 1960.
  32. G. Schmidt, V. Sidorenko, and M. Bossert, “Collaborative decoding of interleaved Reed-Solomon codes and concatenated code designs,” CoRR, vol. abs/cs/0610074, 2006. [Online]. Available: http://arxiv.org/abs/cs/0610074
  33. G. Schmidt, V. R. Sidorenko, and M. Bossert, “Syndrome decoding of Reed-Solomon codes beyond half the minimum distance based on shift-register synthesis,” IEEE Transactions on Information Theory, vol. 56, no. 10, pp. 5245–5252, 2010.
  34. B. Schroeder, E. Pinheiro, and W.-D. Weber, “DRAM errors in the wild: a large-scale field study,” ACM SIGMETRICS Performance Evaluation Review, vol. 37, no. 1, pp. 193–204, 2009.
  35. M. Sudan, “Decoding of Reed Solomon codes beyond the error-correction bound,” Journal of complexity, vol. 13, no. 1, pp. 180–193, 1997.
  36. N. Tang and Y. S. Han, “New decoding of Reed-Solomon codes based on FFT and modular approach,” arXiv preprint arXiv:2207.11079, 2022.

Summary

We haven't generated a summary for this paper yet.