Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DR-CGRA: Supporting Loop-Carried Dependencies in CGRAs Without Spilling Intermediate Values (2405.17365v1)

Published 27 May 2024 in cs.AR

Abstract: Coarse-grain reconfigurable architectures (CGRAs) are gaining traction thanks to their performance and power efficiency. Utilizing CGRAs to accelerate the execution of tight loops holds great potential for achieving significant overall performance gains, as a substantial portion of program execution time is dedicated to tight loops. But loop parallelization using CGRAs is challenging because of loop-carried data dependencies. Traditionally, loop-carried dependencies are handled by spilling dependent values out of the reconfigurable array to a memory medium and then feeding them back to the grid. Spilling the values and feeding them back into the grid imposes additional latencies and logic that impede performance and limit parallelism. In this paper, we present the Dependency Resolved CGRA (DR-CGRA) architecture that is designed to accelerate the execution of tight loops. DR-CGRA, which is based on a massively-multithreaded CGRA, runs each iteration as a separate CGRA thread and maps loop-carried data dependencies to inter-thread communication inside the grid. This design ensures the passage of data-dependent values across loop iterations without spilling them out of the grid. The proposed DR-CGRA architecture was evaluated on various SPEC CPU 2017 benchmarks. The results demonstrated significant performance improvements, with an average speedup ranging from 2.1 to 4.5 and an overall average of 3.1 when compared to state-of-the-art CGRA architecture.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (34)
  1. P. S. Käsgen, M. Weinhardt, and C. Hochberger, “A coarse-grained reconfigurable array for high-performance computing applications,” in 2018 Intl. Conf. on ReConFigurable Computing and FPGAs (ReConFig), IEEE, 2018.
  2. A. Kulkarni, E. Vasteenkiste, D. Stroobandt, A. Brokalakis, and A. Nikitakis, “A fully parameterized virtual coarse grained reconfigurable array for high performance computing applications,” in 2016 IEEE Intl. Parallel and Distributed Processing Symp. Workshops (IPDPSW), IEEE, 2016.
  3. A. X. M. Chang, P. Khopkar, B. Romanous, A. Chaurasia, P. Estep, S. Windh, D. Vanesko, S. D. B. Mohideen, and E. Culurciello, “Reinforcement learning approach for mapping applications to dataflow-based coarse-grained reconfigurable array,” arXiv preprint arXiv:2205.13675, 2022.
  4. K. Mercado, S. Bavikadi, and S. M. PD, “Coarse-grained high-speed reconfigurable array-based approximate accelerator for deep learning applications,” in 2023 57th Annual Conf. on Information Sciences and Systems (CISS), IEEE, 2023.
  5. K. Patel, S. McGettrick, and C. J. Bleakley, “Syscore: A coarse grained reconfigurable array architecture for low energy biosignal processing,” in 2011 IEEE 19th Annual Intl. Symp. on Field-Programmable Custom Computing Machines, IEEE, 2011.
  6. C. Kim, M. Chung, Y. Cho, M. Konijnenburg, S. Ryu, and J. Kim, “Ulp-srp: Ultra low-power samsung reconfigurable processor for biomedical applications,” ACM Transactions on Reconfigurable Technology and Systems (TRETS), vol. 7, no. 3, 2014.
  7. R. Prabhakar, Y. Zhang, D. Koeplinger, M. Feldman, T. Zhao, S. Hadjis, A. Pedram, C. Kozyrakis, and K. Olukotun, “Plasticine: A reconfigurable architecture for parallel paterns,” ACM SIGARCH Computer Architecture News, vol. 45, no. 2, pp. 389–402, 2017.
  8. T. N. Theis and H.-S. P. Wong, “The end of moore’s law: A new beginning for information technology,” Computing in Science & Engineering, vol. 19, no. 2, 2017.
  9. A. A. Chien and V. Karamcheti, “Moore’s law: The first ending and a new beginning,” Computer, vol. 46, no. 12, 2013.
  10. K. Palem and A. Lingamneni, “What to do about the end of moore’s law, probably!,” in Proceedings of the 49th Annual Design Automation conf., 2012.
  11. R. Hameed, W. Qadeer, M. Wachs, O. Azizi, A. Solomatnikov, B. C. Lee, S. Richardson, C. Kozyrakis, and M. Horowitz, “Understanding sources of inefficiency in general-purpose chips,” in Proceedings of the 37th annual Intl. Symp. on Computer architecture, 2010.
  12. A. Putnam, A. M. Caulfield, E. S. Chung, D. Chiou, K. Constantinides, J. Demme, H. Esmaeilzadeh, J. Fowers, G. P. Gopal, J. Gray, et al., “A reconfigurable fabric for accelerating large-scale datacenter services,” IEEE Micro, vol. 35, no. 3, 2015.
  13. R. Wilhelm, J. Engblom, A. Ermedahl, N. Holsti, S. Thesing, D. Whalley, G. Bernat, C. Ferdinand, R. Heckmann, T. Mitra, et al., “The worst-case execution-time problem—overview of methods and survey of tools,” ACM Transactions on Embedded Computing Systems (TECS), vol. 7, no. 3, 2008.
  14. S. Cotterell and F. Vahid, “Synthesis of customized loop caches for core-based embedded systems,” in Proceedings of the 2002 IEEE/ACM Intl. Conf. on Computer-aided design, 2002.
  15. L. H. Lee, B. Moyer, J. Arends, and A. Arbor, “Low-cost embedded program loop caching-revisited,” University of Michigan Technical Report CSE-TR-411-99, 1999.
  16. L. H. Lee, B. Moyer, and J. Arends, “Instruction fetch energy reduction using loop caches for embedded applications with small tight loops,” in Proceedings of the 1999 Intl. Symp. on Low power electronics and design, 1999.
  17. J. W. Davidson and S. Jinturkar, “An aggressive approach to loop unrolling,” tech. rep., Citeseer, 1995.
  18. J. Villarreal, R. Lysecky, S. Cotterell, and F. Vahid, “A study on the loop behavior of embedded programs,” University of California, Riverside, Tech. Rep. UCR-CSE-01-03, 2001.
  19. V. Govindaraju, C.-H. Ho, and K. Sankaralingam, “Dynamically specialized datapaths for energy efficient computing,” in 2011 IEEE 17th Intl. Symp. on High Performance Computer Architecture, IEEE, 2011.
  20. V. Govindaraju, C.-H. Ho, T. Nowatzki, J. Chhugani, N. Satish, K. Sankaralingam, and C. Kim, “Dyser: Unifying functionality and parallelism specialization for energy-efficient computing,” IEEE Micro, vol. 32, no. 5, 2012.
  21. S. Srinath, B. Ilbeyi, M. Tan, G. Liu, Z. Zhang, and C. Batten, “Architectural specialization for inter-iteration loop dependence patterns,” in 2014 47th Annual IEEE/ACM Intl. Symp. on Microarchitecture, IEEE, 2014.
  22. S. Gupta, S. Feng, A. Ansari, S. Mahlke, and D. August, “Bundled execution of recurring traces for energy-efficient general purpose processing,” in Proceedings of the 44th Annual IEEE/ACM Intl. Symp. on Microarchitecture, 2011.
  23. N. Vachharajani, R. Rangan, E. Raman, M. J. Bridges, G. Ottoni, and D. I. August, “Speculative decoupled software pipelining,” in 16th Intl. Conf. on Parallel Architecture and Compilation Techniques (PACT 2007), IEEE, 2007.
  24. D. Voitsechov and Y. Etsion, “Control flow coalescing on a hybrid dataflow/von neumann gpgpu,” in Proceedings of the 48th Intl. Symp. on Microarchitecture, 2015.
  25. D. Voitsechov and Y. Etsion, “Single-graph multiple flows: Energy efficient design alternative for gpgpus,” ACM SIGARCH computer architecture news, vol. 42, no. 3, 2014.
  26. D. Voitsechov, O. Port, and Y. Etsion, “Inter-thread communication in multithreaded, reconfigurable coarse-grain arrays,” in 2018 51st Annual IEEE/ACM Intl. Symp. on Microarchitecture (MICRO), IEEE, 2018.
  27. “Spec cpu 2017.” https://www.spec.org/cpu2017/.
  28. “Intel pin - a dynamic binary instrumentation tool.” https://www.intel.com/content/www/us/en/developer/articles/tool/pin-a-binary-instrumentation-tool-downloads.html.
  29. C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood, “Pin: building customized program analysis tools with dynamic instrumentation,” Acm sigplan notices, vol. 40, no. 6, 2005.
  30. J. R. Hauser and J. Wawrzynek, “Garp: A mips processor with a reconfigurable coprocessor,” in Proceedings. The 5th Annual IEEE Symp. on Field-Programmable Custom Computing Machines Cat. No. 97TB100186), IEEE, 1997.
  31. T. J. Callahan, J. R. Hauser, and J. Wawrzynek, “The garp architecture and c compiler,” Computer, vol. 33, no. 4, 2000.
  32. T. J. Callahan and J. Wawrzynek, “Adapting software pipelining for reconfigurable computing,” in Proceedings of the 2000 Intl. Conf. on Compilers, architecture, and synthesis for embedded systems, 2000.
  33. K. Fan, M. Kudlur, G. Dasika, and S. Mahlke, “Bridging the computation gap between programmable processors and hardwired accelerators,” in 2009 IEEE 15th Intl. Symp. on High Performance Computer Architecture, IEEE, 2009.
  34. K. Fan, M. Kudlur, H. Park, and S. Mahlke, “Cost sensitive modulo scheduling in a loop accelerator synthesis system,” in 38th Annual IEEE/ACM Intl. Symp. on Microarchitecture (MICRO’05), IEEE, 2005.

Summary

We haven't generated a summary for this paper yet.