Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Sectored DRAM: A Practical Energy-Efficient and High-Performance Fine-Grained DRAM Architecture (2207.13795v4)

Published 27 Jul 2022 in cs.AR

Abstract: We propose Sectored DRAM, a new, low-overhead DRAM substrate that reduces wasted energy by enabling fine-grained DRAM data transfers and DRAM row activation. Sectored DRAM leverages two key ideas to enable fine-grained data transfers and row activation at low chip area cost. First, a cache block transfer between main memory and the memory controller happens in a fixed number of clock cycles where only a small portion of the cache block (a word) is transferred in each cycle. Sectored DRAM augments the memory controller and the DRAM chip to execute cache block transfers in a variable number of clock cycles based on the workload access pattern with minor modifications to the memory controller's and the DRAM chip's circuitry. Second, a large DRAM row, by design, is already partitioned into smaller independent physically isolated regions. Sectored DRAM provides the memory controller with the ability to activate each such region based on the workload access pattern via small modifications to the DRAM chip's array access circuitry. Activating smaller regions of a large row relaxes DRAM power delivery constraints and allows the memory controller to schedule DRAM accesses faster. Compared to a system with coarse-grained DRAM, Sectored DRAM reduces the DRAM energy consumption of highly-memory-intensive workloads by up to (on average) 33% (20%) while improving their performance by up to (on average) 36% (17%). Sectored DRAM's DRAM energy savings, combined with its system performance improvement, allows system-wide energy savings of up to 23%. Sectored DRAM's DRAM chip area overhead is 1.7% the area of a modern DDR4 chip. We hope and believe that Sectored DRAM's ideas and results will help to enable more efficient and high-performance memory systems. To this end, we open source Sectored DRAM at https://github.com/CMU-SAFARI/Sectored-DRAM.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (118)
  1. J. H. Ahn et al., “Future Scaling of Processor-Memory Interfaces,” in SC, 2009.
  2. J. H. Ahn et al., “Multicore DIMM: An Energy Efficient Memory Module with Independently Controlled DRAMs,” CAL, 2009.
  3. T. Alawneh et al., “Dynamic Row Activation Mechanism for Multi-Core Systems,” in CF, 2021.
  4. D. B. Alpert and M. J. Flynn, “Performance Trade-Offs for Microprocessor Cache Memories,” IEEE Micro, 1988.
  5. AMD, “BIOS and Kernel Developer’s Guide (BKDG) for AMD Family 15h Models 00h-0Fh Processors,” Developer’s Guide, 2013.
  6. AMD, “uProf User Guide,” https://www.amd.com/content/dam/amd/en/documents/developer/version-4-1-documents/uprof/uprof-ug-rev-4.1.pdf, 2022.
  7. C. Anderson and J.-L. Baer, “Two Techniques for Improving Performance on Bus-Based Multiprocessors,” in HPCA, 1995.
  8. R. Bera et al., “Hermes: Accelerating Long-Latency Load Requests via Perceptron-Based Off-Chip Load Prediction,” in MICRO, 2022.
  9. R. Bera et al., “Pythia: A Customizable Hardware Prefetching Framework Using Online Reinforcement Learning,” in MICRO, 2021.
  10. I. Bhati et al., “Flexible Auto-Refresh: Enabling Scalable and Energy-Efficient DRAM Refresh Reductions,” in ISCA, 2015.
  11. T. M. Brewer, “Instruction Set Innovations for the Convey HC-1 Computer,” IEEE Micro, 2010.
  12. K. Chandrasekar et al., “DRAMPower: Open-Source DRAM Power & Energy Estimation Tool,” http://www.drampower.info, 2012.
  13. K. K. Chang et al., “Understanding Latency Variation in Modern DRAM Chips: Experimental Characterization, Analysis, and Optimization,” in SIGMETRICS, 2016.
  14. K. K. Chang et al., “Understanding Reduced-Voltage Operation in Modern DRAM Devices: Experimental Characterization, Analysis, and Mechanisms,” in SIGMETRICS, 2017.
  15. N. Chatterjee et al., “Architecting an Energy-Efficient DRAM System for GPUs,” in HPCA, 2017.
  16. C. F. Chen et al., “Accurate and Complexity-Effective Spatial Pattern Prediction,” in HPCA, 2004.
  17. C. L. Chen, “Symbol Error Correcting Codes for Memory Applications,” in Annual Symposium on Fault Tolerant Computing, 1996.
  18. E. Cooper-Balis and B. Jacob, “Fine-Grained Activation for Power Reduction in DRAM,” IEEE Micro, 2010.
  19. A. Dalalah et al., “New Hardware Architecture for Bit-Counting,” in WSEAS, 2006.
  20. R. Das et al., “Application-Aware Prioritization Mechanisms for On-Chip Networks,” in MICRO, 2009.
  21. H. David et al., “Memory Power Management via Dynamic Voltage/Frequency Scaling,” in ICAC, 2011.
  22. R. H. Dennard, “Field-Effect Transistor Memory,” US Patent No. 3,387,286, 1968.
  23. S. Eyerman and L. Eeckhout, “System-Level Performance Metrics for Multiprogram Workloads,” IEEE Micro, 2008.
  24. L. Frontini et al., “A Very Compact Population Count Circuit for Associative Memories,” in MOCAST, 2018.
  25. E. Garza et al., “Bit-level Perceptron Prediction for Indirect Branches,” in ISCA, 2019.
  26. S. Ghose et al., “Demystifying Complex Workload-DRAM Interactions: An Experimental Study,” in SIGMETRICS, 2019.
  27. H. Ha et al., “Improving Energy Efficiency of DRAM by Exploiting Half Page Row Access,” in MICRO, 2016.
  28. G. Hamerly et al., “Simpoint 3.0: Faster and More Flexible Program Phase Analysis,” Journal of Instruction Level Parallelism, 2005.
  29. P. Hammarlund et al., “Haswell: The Fourth-Generation Intel Core Processor,” IEEE Micro, 2014.
  30. M. Hashemi et al., “Accelerating Dependent Cache Misses with an Enhanced Memory Controller,” in ISCA, 2016.
  31. H. Hassan et al., “CROW: A Low-Cost Substrate for Improving DRAM Performance, Energy Efficiency, and Reliability,” in ISCA, 2019.
  32. H. Hassan et al., “ChargeCache: Reducing DRAM Latency by Exploiting Row Access Locality,” in HPCA, 2016.
  33. M. D. Hill and A. J. Smith, “Experimental Evaluation of On-Chip Microprocessor Cache Memories,” in ISCA, 1984.
  34. S. Iacobovici et al., “Effective Stream-Based and Execution-Based Data Prefetching,” in ICS, 2004.
  35. K. Inoue et al., “Dynamically Variable Line-Size Cache Exploiting High On-chip Memory Bandwidth of Merged DRAM/Logic LSIs,” in HPCA, 1999.
  36. Intel, “Intel Alder Lake Events,” https://perfmon-events.intel.com/, 2022.
  37. Intel, “Intel Performance Counter Monitor - A Better Way to Measure CPU Utilization,” https://intel.ly/3xLo80Y, 2022.
  38. E. Ipek et al., “Self-Optimizing Memory Controllers: A Reinforcement Learning Approach,” in ISCA, 2008.
  39. D. A. Jiménez, “Fast Path-Based Neural Branch Prediction,” in MICRO, 2003.
  40. D. A. Jiménez and C. Lin, “Dynamic Branch Prediction with Perceptrons,” in HPCA, 2001.
  41. D. A. Jiménez and C. Lin, “Neural Methods for Dynamic Branch Prediction,” TOCS, 2002.
  42. D. A. Jiménez and E. Teran, “Multiperspective Reuse Prediction,” in MICRO, 2017.
  43. M. Kadiyala and L. Bhuyan, “A dynamic cache sub-block design to reduce false sharing,” in ICCD, 1995.
  44. D. Kaseridis et al., “Minimalist Open-Page: A DRAM Page-Mode Scheduling Policy for the Many-Core Era,” in MICRO, 2011.
  45. J. S. Kim et al., “The DRAM Latency PUF: Quickly Evaluating Physical Unclonable Functions by Exploiting the Latency-Reliability Tradeoff in Modern Commodity DRAM Devices,” in HPCA, 2018.
  46. J. S. Kim et al., “D-RaNGe: Using Commodity DRAM Devices to Generate True Random Numbers With Low Latency and High Throughput,” in HPCA, 2019.
  47. J. S. Kim et al., “Revisiting RowHammer: An Experimental Analysis of Modern DRAM Devices and Mitigation Techniques,” in ISCA, 2020.
  48. Y. Kim et al., “ATLAS: A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers,” in HPCA, 2010.
  49. Y. Kim et al., “Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior,” in MICRO, 2010.
  50. Y. Kim et al., “A Case for Exploiting Subarray-Level Parallelism (SALP) in DRAM,” in ISCA, 2012.
  51. Y. Kim et al., “Ramulator: A Fast and Extensible DRAM Simulator,” CAL, 2016.
  52. K. Koo et al., “A 1.2V 38nm 2.4Gb/s/pin 2Gb DDR4 SDRAM with Bank Group and ×4 Half-Page Architecture,” in ISSCC, 2012.
  53. S. Kumar and C. Wilkerson, “Exploiting Spatial Locality in Data Caches Using Spatial Footprints,” in ISCA, 1998.
  54. S. Kumar et al., “Amoeba-Cache: Adaptive Blocks for Eliminating Waste in the Memory Hierarchy,” in MICRO, 2012.
  55. C. J. Lee et al., “Improving Memory Bank-Level Parallelism in the Presence of Prefetching,” in MICRO, 2009.
  56. D. Lee et al., “Design-Induced Latency Variation in Modern DRAM Chips: Characterization, Analysis, and Latency Reduction Mechanisms,” in SIGMETRICS, 2017.
  57. D. Lee et al., “Adaptive-Latency DRAM: Optimizing DRAM Timing for the Common-Case,” in HPCA, 2015.
  58. D. Lee et al., “Tiered-Latency DRAM: A Low Latency and Low Cost DRAM Architecture,” in HPCA, 2013.
  59. D. Lee et al., “Decoupled Direct Memory Access: Isolating CPU and IO Traffic by Leveraging a Dual-Data-Port DRAM,” in PACT, 2015.
  60. Y. Lee et al., “Partial Row Activation for Low-Power DRAM System,” in HPCA, 2017.
  61. C. Lefurgy et al., “Energy Management for Commercial Servers,” Computer, 2003.
  62. S. Li et al., “DRISA: A DRAM-Based Reconfigurable In-Situ Accelerator,” in MICRO, 2017.
  63. S. Li et al., “The McPAT Framework for Multicore and Manycore Architectures Simultaneously Modeling Power, Area, and Timing,” TACO, 2013.
  64. J. S. Liptay, “Structural Aspects of the System/360 Model 85, II: The Cache,” IBM Systems Journal, 1968.
  65. K.-C. Liu and C.-T. King, “On the Effectiveness of Sectored Caches in Reducing False Sharing Misses,” in ICPADS, 1997.
  66. H. Luo et al., “Ramulator 2.0: A Modern, Modular, and Extensible DRAM Simulator,” arXiv:2308.11030 [cs.AR], 2023.
  67. H. Luo et al., “RowPress: Amplifying Read Disturbance in Modern DRAM Chips,” in ISCA, 2023.
  68. J. A. Mandelman et al., “Challenges and Future Directions for the Scaling of Dynamic Random-Access Memory (DRAM),” IBM JRD, 2002.
  69. T. Moscibroda and O. Mutlu, “Memory Performance Attacks: Denial of Memory Service in Multi-Core Systems,” in USENIX Security, 2007.
  70. N. Muralimanohar et al., “CACTI 6.0: A Tool to Model Large Caches,” HP Laboratories, Tech. Rep. HPL-2009-85, 2009.
  71. O. Mutlu, “Memory Scaling: A Systems Architecture Perspective,” in IMW, 2013.
  72. O. Mutlu and T. Moscibroda, “Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors,” in MICRO, 2007.
  73. O. Mutlu and T. Moscibroda, “Parallelism-Aware Batch Scheduling: Enhancing Both Performance and Fairness of Shared DRAM Systems,” in ISCA, 2008.
  74. K. J. Nesbit et al., “Fair Queuing Memory Systems,” in MICRO, 2006.
  75. A. Olgun et al., “QUAC-TRNG: High-Throughput True Random Number Generation Using Quadruple Row Activation in Commodity DRAMs,” in ISCA, 2021.
  76. G. F. Oliveira et al., “MIMDRAM: An End-to-End Processing-Using-DRAM System for High-Throughput, Energy-Efficient and Programmer-Transparent Multiple-Instruction Multiple-Data Computing,” in HPCA, 2024.
  77. G. F. Oliveira et al., “DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement Bottlenecks,” IEEE Access, 2021.
  78. M. O’Connor et al., “Fine-Grained DRAM: Energy-Efficient DRAM for Extreme Bandwidth Systems,” in MICRO, 2017.
  79. M. Patel, “Enabling Effective Error Mitigation In Memory Chips That Use On-Die Error-Correcting Codes,” Ph.D. dissertation, ETH Zürich, 2022.
  80. I. Paul et al., “Harmonia: Balancing Compute and Memory Power in High-Performance GPUs,” in ISCA, 2015.
  81. P. Pujara and A. Aggarwal, “Increasing the Cache Efficiency by Eliminating Noise,” in HPCA, 2006.
  82. M. K. Qureshi et al., “Line Distillation: Increasing Cache Capacity by Filtering Unused Words in Cache Lines,” in HPCA, 2007.
  83. Rambus, “DRAM Power Model,” https://www.rambus.com/energy/, 2014.
  84. Rambus, “TN-40-07: Calculating Memory Power for DDR4 SDRAM,” https://www.micron.com/-/media/client/global/documents/products/technical-note/dram/tn4007_ddr4_power_calculation.pdf, 2017.
  85. J. B. Rothman and A. Smith, “Sector Cache Design and Performance,” in MASCOTS, 2000.
  86. J. B. Rothman and A. J. Smith, “The Pool of Subsectors Cache Design,” in ICS, 1999.
  87. J. B. Rothman and A. J. Smith, “Minerva: An Adaptive Subblock Coherence Protocol for Improved SMP Performance,” in ISHPC, 2002.
  88. SAFARI Research Group, “Ramulator – GitHub Page,” https://github.com/CMU-SAFARI/ramulator, 2023.
  89. SAFARI Research Group, “Ramulator 2.0 — GitHub Repository,” https://github.com/CMU-SAFARI/ramulator2, 2023.
  90. SAFARI Research Group, “Sectored dram – github page,” https://github.com/CMU-SAFARI/Sectored-DRAM, 2024.
  91. V. Seshadri et al., “RowClone: Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization,” in MICRO, 2013.
  92. V. Seshadri et al., “Ambit: In-Memory Accelerator for Bulk Bitwise Operations Using Commodity DRAM Technology,” in MICRO, 2017.
  93. V. Seshadri and O. Mutlu, “In-DRAM Bulk Bitwise Execution Engine,” arXiv:1905.09822 [cs.AR], 2019.
  94. A. Seznec, “Decoupled Sectored Caches: Conciliating Low Tag Implementation Cost,” in ISCA, 1994.
  95. A. J. Smith, “Line (Block) Size Choice for CPU Cache Memories,” TC, 1987.
  96. J. E. Smith et al., “The ZS-1 Central Processor,” in ASPLOS, 1987.
  97. J. E. Smith and G. Sohi, “The microarchitecture of superscalar processors,” Proceedings of the IEEE, 1995.
  98. A. Snavely and D. M. Tullsen, “Symbiotic Jobscheduling for a Simultaneous Multithreaded Processor,” in ASPLOS, 2000.
  99. Y. H. Son et al., “Microbank: Architecting Through-Silicon Interposer-Based Main Memory Systems,” in SC, 2014.
  100. S. Srinath et al., “Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers,” in HPCA, 2007.
  101. Standard Performance Evaluation Corp., “SPEC CPU® 2006,” http://www.spec.org/cpu2006, 2006.
  102. Standard Performance Evaluation Corp., “SPEC CPU® 2017,” http://www.spec.org/cpu2017, 2017.
  103. L. Subramanian et al., “BLISS: Balancing Performance, Fairness and Complexity in Memory Access Scheduling,” TPDS, 2016.
  104. K. Sudan et al., “Micro-Pages: Increasing DRAM Efficiency with Locality-Aware Data Placement,” in ISCA, 2010.
  105. E. Teran et al., “Perceptron Learning for Reuse Prediction,” in MICRO, 2016.
  106. A. N. Udipi et al., “Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores,” in ISCA, 2010.
  107. T. Vogelsang, “Understanding the Energy Consumption of Dynamic Random Access Memories,” in ISCA, 2010.
  108. M. Ware et al., “Architecting for Power Management: The IBM® POWER7™ Approach,” in HPCA, 2010.
  109. A. G. Yaglikci et al., “HiRA: Hidden Row Activation for Reducing Refresh Latency of Off-the-Shelf DRAM Chips,” in MICRO, 2022.
  110. K. Yeager, “The Mips R10000 Superscalar Microprocessor,” IEEE Micro, 1996.
  111. R. Yeleswarapu and A. K. Somani, “Addressing Multiple Bit/Symbol Errors in DRAM Subsystem,” arXiv:1908.01806, 2020.
  112. D. H. Yoon et al., “Adaptive Granularity Memory Systems: A Tradeoff Between Storage Efficiency and Throughput,” in ISCA, 2011.
  113. D. H. Yoon et al., “The Dynamic Granularity Memory System,” in ISCA, 2012.
  114. G. L. Yuan et al., “Complexity Effective Memory Access Scheduling for Many-Core Accelerator Architectures,” in MICRO, 2009.
  115. I. E. Yuksel et al., “Functionally-Complete Boolean Logic in Real DRAM Chips: Experimental Characterization and Analysis,” in HPCA, 2024.
  116. C. Zhang and X. Guo, “Enabling Efficient Fine-Grained DRAM Activations with Interleaved I/O,” in ISLPED, 2017.
  117. T. Zhang et al., “Half-DRAM: A High-Bandwidth and Low-Power DRAM Architecture from the Rethinking of Fine-Grained Activation,” in ISCA, 2014.
  118. H. Zheng et al., “Mini-Rank: Adaptive DRAM Architecture for Improving Memory Power Efficiency,” in MICRO, 2008.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com