Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Field-Programmable Gate Array Architecture for Deep Learning: Survey & Future Directions (2404.10076v1)

Published 15 Apr 2024 in cs.AR

Abstract: Deep learning (DL) is becoming the cornerstone of numerous applications both in datacenters and at the edge. Specialized hardware is often necessary to meet the performance requirements of state-of-the-art DL models, but the rapid pace of change in DL models and the wide variety of systems integrating DL make it impossible to create custom computer chips for all but the largest markets. Field-programmable gate arrays (FPGAs) present a unique blend of reprogrammability and direct hardware execution that make them suitable for accelerating DL inference. They offer the ability to customize processing pipelines and memory hierarchies to achieve lower latency and higher energy efficiency compared to general-purpose CPUs and GPUs, at a fraction of the development time and cost of custom chips. Their diverse high-speed IOs also enable directly interfacing the FPGA to the network and/or a variety of external sensors, making them suitable for both datacenter and edge use cases. As DL has become an ever more important workload, FPGA architectures are evolving to enable higher DL performance. In this article, we survey both academic and industrial FPGA architecture enhancements for DL. First, we give a brief introduction on the basics of FPGA architecture and how its components lead to strengths and weaknesses for DL applications. Next, we discuss different styles of DL inference accelerators on FPGA, ranging from model-specific dataflow styles to software-programmable overlay styles. We survey DL-specific enhancements to traditional FPGA building blocks such as logic blocks, arithmetic circuitry, and on-chip memories, as well as new in-fabric DL-specialized blocks for accelerating tensor computations. Finally, we discuss hybrid devices that combine processors and coarse-grained accelerator blocks with FPGA-like interconnect and networks-on-chip, and highlight promising future research directions.

FPGA Architecture for Deep Learning Acceleration

The paper "Field-Programmable Gate Array Architecture for Deep Learning: Survey and Future Directions" offers a comprehensive analysis of the evolving role of FPGAs in the domain of deep learning (DL) acceleration. Recognizing the increasing computational demands of DL workloads, the authors delve into how FPGAs can fulfill these requirements with their unique blend of flexibility, performance, and adaptability.

Summary of FPGA Advantages for DL

FPGA devices possess several intrinsic strengths that make them suitable for DL tasks:

  1. Custom Precision and Dataflow: FPGAs allow for the implementation of low-precision arithmetic operations which are often suitable for DL inference, leading to area and power savings. This specificity is in contrast to CPUs and GPUs which adhere to fixed precision formats.
  2. Spatial Architecture: The spatial nature of FPGAs enables direct data flow between computing elements, reducing latency significantly and enhancing performance for applications with tight latency constraints.
  3. Reconfigurability: The ability to reconstruct the FPGA for specific DL models offers an edge over ASICs by adapting to newly developed models and facilitating rapid deployment.
  4. Diverse IO Capabilities: FPGAs support a variety of interfaces, allowing them to be integrated with different sensors and peripherals which is advantageous for edge DL applications.

Design Styles for DL Acceleration

The paper explores various design methodologies for implementing DL accelerators on FPGAs:

  • Custom Hardware Generation: This approach automatically generates model-specific hardware, optimizing resources, and is demonstrated by tools like HPIPE. Such tools enable bespoke pipeline architectures tailored to individual models, offering performance improvements but requiring longer synthesis times.
  • FPGA Overlays: These software-programmable architectures such as the NPU overlay deliver high performance for batch-1 inference tasks by abstracting hardware details and enabling flexible deployment across multiple DL workloads.

FPGA Architecture Enhancements

Several architecture modifications have been researched and proposed to optimize FPGAs for DL:

  1. Logic Blocks: Enhancements in logic block design can increase the density of low-precision arithmetic operations, a critical requirement for efficient DL inference.
  2. DSP Blocks: Augmenting digital signal processing blocks with capabilities for lower precision operations can significantly improve multiplication and accumulation throughput.
  3. Block RAMs (BRAMs): By integrating compute capabilities within BRAMs, data movement can be minimized, conserving power and routing resources.
  4. Interposer Technology: Advanced packaging techniques enable integration of multiple dice, crucial for constructing larger, more capable FPGA systems for DL.
  5. Networks-on-Chip and AI Engines: Emerging architectures like AMD’s Versal incorporate AI engines connected by a network-on-chip (NoC), suiting them for a wide range of DL applications by combining FPGA flexibility with efficient coarse-grained accelerators.

Implications and Future Directions

The path forward suggests several promising avenues for enhancing FPGAs in DL contexts. The paper highlights potential improvements including deeper integration of AI-specific hard blocks and exploration of new design paradigms that mix reconfigurable logic with fixed-function ASIC-like elements. Future architectures may leverage 2D/3D integration to further improve performance and energy efficiency.

In summary, the paper underscores that with strategic architectural innovations, FPGAs hold substantial promise for efficiently accelerating DL workloads, spanning from large-scale datacenter applications to resource-constrained edge environments.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (134)
  1. N. Dalal and B. Triggs, “Histograms of Oriented Gradients for Human Detection,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 1, 2005, pp. 886–893.
  2. D. Rumelhart, G. Hinton, and R. Williams, “Learning Internal Representations by Error Propagation,” in Neurocomputing: Foundations of Research, 1988, pp. 673–695.
  3. A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks,” Advances in Neural Information Processing Systems (NeurIPS), 2012.
  4. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is All You Need,” Advances in Neural Information Processing Systems (NeurIPS), vol. 30, 2017.
  5. M. Naumov, D. Mudigere, H.-J. M. Shi, J. Huang, N. Sundaraman, J. Park, X. Wang, U. Gupta, C.-J. Wu, A. G. Azzolini et al., “Deep Learning Recommendation Model for Personalization and Recommendation Systems,” arXiv preprint arXiv:1906.00091, 2019.
  6. A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, and I. Sutskever, “Zero-Shot Text-to-Image Generation,” in International Conference on Machine Learning (ICML).   PMLR, 2021, pp. 8821–8831.
  7. M. Haldar, M. Abdool, P. Ramanathan, T. Xu, S. Yang, H. Duan, Q. Zhang, N. Barrow-Williams, B. C. Turnbull, B. M. Collins et al., “Applying Deep Learning to Airbnb Search,” in proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & Data Mining, 2019, pp. 1927–1935.
  8. T. Capes, P. Coles, A. Conkie, L. Golipour, A. Hadjitarkhani, Q. Hu, N. Huddleston, M. Hunt, J. Li, M. Neeracher et al., “Siri On-Device Deep Learning-Guided Unit Selection Text-to-Speech System,” in Interspeech, 2017, pp. 4011–4015.
  9. H. Steck, L. Baltrunas, E. Elahi, D. Liang, Y. Raimond, and J. Basilico, “Deep Learning for Recommender Systems: A Netflix Case Study,” AI Magazine, vol. 42, no. 3, pp. 7–18, 2021.
  10. M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman et al., “Evaluating Large Language Models Trained on Code,” arXiv preprint arXiv:2107.03374, 2021.
  11. M. Liu, T.-D. Ene, R. Kirby, C. Cheng, N. Pinckney, R. Liang, J. Alben, H. Anand, S. Banerjee, I. Bayraktaroglu et al., “ChipNeMo: Domain-Adapted LLMs for Chip Design,” arXiv preprint arXiv:2311.00176, 2023.
  12. A. Fawzi, M. Balog, A. Huang, T. Hubert, B. Romera-Paredes, M. Barekatain, A. Novikov, F. J. R Ruiz, J. Schrittwieser, G. Swirszcz et al., “Discovering Faster Matrix Multiplication Algorithms with Reinforcement Learning,” Nature, vol. 610, no. 7930, pp. 47–53, 2022.
  13. J. Jumper, R. Evans, A. Pritzel, T. Green, M. Figurnov, O. Ronneberger, K. Tunyasuvunakool, R. Bates, A. Žídek, A. Potapenko et al., “Highly Accurate Protein Structure Prediction with AlphaFold,” Nature, vol. 596, no. 7873, pp. 583–589, 2021.
  14. A. Suleiman, Y.-H. Chen, J. Emer, and V. Sze, “Towards Closing the Energy Gap between HOG and CNN Features for Embedded Vision,” in IEEE International Symposium on Circuits and Systems (ISCAS), 2017, pp. 1–4.
  15. N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers et al., “In-Datacenter Performance Analysis of a Tensor Processing Unit,” in International Symposium on Computer Architecture (ISCA), 2017, pp. 1–12.
  16. D. Patel and A. Ahmad, “The Inference Cost of Search Disruption: Large Language Model Cost Analysis,” in SemiAnalysis, 2023.
  17. D. Khaldi, Y. Luo, B. Yu, A. Sotkin, B. Morais, and M. Girkar, “Extending LLVM IR for DPC++ Matrix Support: A Case Study with Intel Advanced Matrix Extensions (Intel AMX),” in IEEE Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC), 2021, pp. 20–26.
  18. L. Su, “AMD Keynote Presentation,” in Consumer Electronics Show (CES), 2023. [Online]. Available: https://www.youtube.com/watch?v=OMxU4BDIm4M
  19. A. Weißenberger and B. Schmidt, “Accelerating JPEG Decompression on GPUs,” in International Conference on High Performance Computing, Data, and Analytics (HiPC).   IEEE, 2021, pp. 121–130.
  20. K. Guo, S. Zeng, J. Yu, Y. Wang, and H. Yang, “A Survey of FPGA-based Neural Network Inference Accelerators,” ACM Transactions on Reconfigurable Technology and Systems (TRETS), vol. 12, no. 1, pp. 1–26, 2019.
  21. K. Abdelouahab, M. Pelcat, J. Serot, and F. Berry, “Accelerating CNN Inference on FPGAs: A Survey,” arXiv preprint arXiv:1806.01683, 2018.
  22. S. I. Venieris, A. Kouris, and C.-S. Bouganis, “Toolflows for Mapping Convolutional Neural Networks on FPGAs: A Survey and Future Directions,” ACM Computing Surveys (CSUR), vol. 51, no. 3, pp. 1–39, 2018.
  23. T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language Models are Few-Shot Learners,” Advances in Neural Information Processing Systems (NeurIPS), vol. 33, pp. 1877–1901, 2020.
  24. A. Boutros, E. Nurvitadhi, R. Ma, S. Gribok, Z. Zhao, J. C. Hoe, V. Betz, and M. Langhammer, “Beyond Peak Performance: Comparing the Real Performance of AI-Optimized FPGAs and GPUs,” in International Conference on Field-Programmable Technology (FPT), 2020, pp. 10–19.
  25. M. Hall and V. Betz, “From TensorFlow Graphs to LUTs and Wires: Automated Sparse and Physically Aware CNN Hardware Generation,” in IEEE International Conference on Field-Programmable Technology (FPT), 2020, pp. 56–65.
  26. L. Ganesh, H. Weatherspoon, T. Marian, and K. Birman, “Integrated Approach to Data Center Power Management,” IEEE Transactions on Computers, vol. 62, no. 6, pp. 1086–1096, 2013.
  27. Z. Stone. (2018) Now You Can Train TensorFlow Machine Learning Models Faster and at Lower Cost on Cloud TPU Pods. [Online]. Available: https://cloud.google.com/blog/products/ai-machine-learning/now-you-can-train-ml-models-faster-and-lower-cost-cloud-tpu-pods
  28. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding,” arXiv preprint arXiv:1810.04805, 2018.
  29. O. Sharir, B. Peleg, and Y. Shoham, “The Cost of Training NLP Models: A Concise Overview,” arXiv preprint arXiv:2004.08900, 2020.
  30. N. Jones et al., “How to Stop Data Centres from Gobbling Up the World’s Electricity,” Nature, vol. 561, no. 7722, pp. 163–166, 2018.
  31. E. Talpes, D. D. Sarma, G. Venkataramanan, P. Bannon, B. McGee, B. Floering, A. Jalote, C. Hsiong, S. Arora, A. Gorti et al., “Compute Solution for Tesla’s Full Self-Driving Computer,” IEEE Micro, vol. 40, no. 2, pp. 25–35, 2020.
  32. B. Darvish Rouhani, D. Lo, R. Zhao, M. Liu, J. Fowers, K. Ovtcharov, A. Vinogradsky, S. Massengill, L. Yang, R. Bittner et al., “Pushing the Limits of Narrow Precision Inferencing at Cloud Scale with Microsoft Floating Point,” Advances in Neural Information Processing Systems (NeurIPS), vol. 33, pp. 10 271–10 281, 2020.
  33. P. Micikevicius, D. Stosic, N. Burgess, M. Cornea, P. Dubey, R. Grisenthwaite, S. Ha, A. Heinecke, P. Judd, J. Kamalu et al., “FP8 Formats for Deep Learning,” arXiv preprint arXiv:2209.05433, 2022.
  34. M. Horowitz, “Computing’s Energy Problem (and What We Can Do About It),” in International Solid-State Circuits Conference (ISSCC), 2014, pp. 10–14.
  35. E. S. Chung, D. Burger, J. Fowers, M. Ghandi, G. Weisz, S. Lanka, and S. K. Reinhardt, “RETROSPECTIVE: A Configurable Cloud-Scale DNN Processor for Real-Time AI,” in ISCA@50 25-Year Retrospective: 1996-2020.   ACM SIGARCH and IEEE TCCA, 2023. [Online]. Available: https://bit.ly/isca50_retrospective
  36. J. Fowers, K. Ovtcharov, M. Papamichael, T. Massengill, M. Liu, D. Lo, S. Alkalay, M. Haselman, L. Adams, M. Ghandi et al., “A Configurable Cloud-Scale DNN Processor for Real-Time AI,” in International Symposium on Computer Architecture (ISCA), 2018, pp. 1–14.
  37. M. Urbina, T. Acosta, J. Lázaro, A. Astarloa, and U. Bidarte, “Smart Sensor: SoC Architecture for the Industrial Internet of Things,” IEEE Internet of Things Journal, vol. 6, no. 4, pp. 6567–6577, 2019.
  38. A. Mishra, E. Nurvitadhi, J. J. Cook, and D. Marr, “WRPN: Wide Reduced-Precision Networks,” arXiv preprint arXiv:1709.01134, 2017.
  39. Y. Cheng, D. Li, Z. Guo, B. Jiang, J. Lin, X. Fan, J. Geng, X. Yu, W. Bai, L. Qu et al., “DLBooster: Boosting End-to-End Deep Learning Workflows with Offloading Data Preprocessing Pipelines,” in Proceedings of the International Conference on Parallel Processing (ICPP), 2019, pp. 1–11.
  40. H. Pham, M. Guan, B. Zoph, Q. Le, and J. Dean, “Efficient Neural Architecture Search via Parameters Sharing,” in International Conference on Machine Learning (ICML), 2018, pp. 4095–4104.
  41. N. Suda, V. Chandra, G. Dasika, A. Mohanty, Y. Ma, S. Vrudhula, J.-s. Seo, and Y. Cao, “Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks,” in ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA), 2016, pp. 16–25.
  42. C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Optimizing FPGA-Based Accelerator Design for Deep Convolutional Neural Networks,” in ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA), 2015, pp. 161–170.
  43. J. Qiu, J. Wang, S. Yao, K. Guo, B. Li, E. Zhou, J. Yu, T. Tang, N. Xu, S. Song et al., “Going Deeper with Embedded FPGA Platform for Convolutional Neural Network,” in ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA), 2016, pp. 26–35.
  44. H. Li, X. Fan, L. Jiao, W. Cao, X. Zhou, and L. Wang, “A High Performance FPGA-Based Accelerator for Large-Scale Convolutional Neural Networks,” in IEEE International Conference on Field Programmable Logic and Applications (FPL), 2016, pp. 1–9.
  45. E. Nurvitadhi, J. Sim, D. Sheffield, A. Mishra, S. Krishnan, and D. Marr, “Accelerating Recurrent Neural Networks in Analytics Servers: Comparison of FPGA, CPU, GPU, and ASIC,” in International Conference on Field Programmable Logic and Applications (FPL), 2016, pp. 1–4.
  46. A. X. M. Chang, B. Martini, and E. Culurciello, “Recurrent Neural Networks Hardware Implementation on FPGA,” arXiv preprint arXiv:1511.05552, 2015.
  47. S. Li, C. Wu, H. Li, B. Li, Y. Wang, and Q. Qiu, “FPGA Acceleration of Recurrent Neural Network Based Language Model,” in IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2015, pp. 111–118.
  48. M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard et al., “TensorFlow: A System for Large-Scale Machine Learning,” in USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2016, pp. 265–283.
  49. A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “PyTorch: An Imperative Style, High-Performance Deep Learning Library,” Advances in Neural Information Processing Systems (NeurIPS), vol. 32, 2019.
  50. S. I. Venieris and C.-S. Bouganis, “fpgaConvNet: Mapping Regular and Irregular Convolutional Neural Networks on FPGAs,” IEEE Transactions on Neural Networks and Learning Systems, vol. 30, no. 2, pp. 326–342, 2018.
  51. X. Zhang, J. Wang, C. Zhu, Y. Lin, J. Xiong, W.-m. Hwu, and D. Chen, “DNNBuilder: An Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs,” in IEEE/ACM International Conference on Computer-Aided Design (ICCAD), 2018, pp. 1–8.
  52. Y. Umuroglu, N. J. Fraser, G. Gambardella, M. Blott, P. Leong, M. Jahre, and K. Vissers, “FINN: A Framework for Fast, Scalable Binarized Neural Network Inference,” in ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA), 2017, pp. 65–74.
  53. Y. Ma, N. Suda, Y. Cao, J.-s. Seo, and S. Vrudhula, “Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA,” in IEEE International Conference on Field Programmable Logic and Applications (FPL), 2016, pp. 1–8.
  54. H. Sharma, J. Park, D. Mahajan, E. Amaro, J. K. Kim, C. Shao, A. Mishra, and H. Esmaeilzadeh, “From High-Level Deep Neural M0odels to FPGAs,” in IEEE/ACM International Symposium on Microarchitecture (MICRO), 2016, pp. 1–12.
  55. Y. Guan, H. Liang, N. Xu, W. Wang, S. Shi, X. Chen, G. Sun, W. Zhang, and J. Cong, “FP-DNN: An Automated Framework for Mapping Deep Neural Networks onto FPGAs with RTL-HLS Hybrid Templates,” in IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2017, pp. 152–159.
  56. S. Abi-Karam and C. Hao, “GNNBuilder: An Automated Framework for Generic Graph Neural Network Accelerator Generation, Simulation, and Optimization,” arXiv preprint arXiv:2303.16459, 2023.
  57. E. Wang, J. J. Davis, P. Y. Cheung, and G. A. Constantinides, “LUTNet: Learning FPGA Configurations for Highly Efficient Neural Network Inference,” IEEE Transactions on Computers, vol. 69, no. 12, pp. 1795–1808, 2020.
  58. M. Andronic and G. A. Constantinides, “PolyLUT: Learning Piecewise Polynomials for Ultra-Low Latency FPGA LUT-based Inference,” 2023.
  59. H. Wong, V. Betz, and J. Rose, “Comparing FPGA vs. Custom CMOS and the Impact on Processor Microarchitecture,” in ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA), 2011.
  60. A. Boutros, S. Yazdanshenas, and V. Betz, “You Cannot Improve What You Do Not Measure: FPGA vs. ASIC Efficiency Gaps for Convolutional Neural Network Inference,” ACM Transactions on Reconfigurable Technology and Systems (TRETS), vol. 11, no. 3, pp. 1–23, 2018.
  61. U. Aydonat, S. O’Connell, D. Capalija, A. C. Ling, and G. R. Chiu, “An OpenCL Deep Learning Accelerator on Arria 10,” in International Symposium on Field-Programmable Gate Arrays (FPGA), 2017, pp. 55–64.
  62. M. S. Abdelfattah, D. Han, A. Bitar, R. DiCecco, S. O’Connell, N. Shanker, J. Chu, I. Prins, J. Fender, A. C. Ling et al., “DLA: Compiler and FPGA Overlay for Neural Network Inference Acceleration,” in International Conference on Field Programmable Logic and Applications (FPL), 2018, pp. 411–4117.
  63. Advanced Micro Devices, Inc., “DPUCADF8H for Convolutional Neural Networks Product Guide (PG400),” 2022.
  64. Y. Yu, C. Wu, T. Zhao, K. Wang, and L. He, “OPU: An FPGA-based Overlay Processor for Convolutional Neural Networks,” IEEE Transactions on Very Large Scale Integration Systems (TVLSI), vol. 28, no. 1, pp. 35–47, 2019.
  65. Y. Bai, H. Zhou, K. Zhao, J. Chen, J. Yu, and K. Wang, “Transformer-OPU: An FPGA-based Overlay Processor for Transformer Networks,” in International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2023, pp. 221–221.
  66. S. Hur, S. Na, D. Kwon, J. Kim, A. Boutros, E. Nurvitadhi, and J. Kim, “A Fast and Flexible FPGA-based Accelerator for Natural Language Processing Neural Networks,” ACM Transactions on Architecture and Code Optimization (TACO), vol. 20, no. 1, pp. 1–24, 2023.
  67. E. Nurvitadhi, D. Kwon, A. Jafari, A. Boutros, J. Sim, P. Tomson, H. Sumbul, G. Chen, P. Knag, R. Kumar et al., “Why Compete When You Can Work Together: FPGA-ASIC Integration for Persistent RNNs,” in International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2019.
  68. TensorFlow, “Keras: The High-Level API for TensorFlow,” in https://www.tensorflow.org/guide/keras, [Online; last accessed October 2023].
  69. Baidu, “DeepBench,” in https://github.com/baidu-research/DeepBench, [Online; last accessed October 2023].
  70. M. Langhammer, B. Pasca, G. Baeckler, and S. Gribok, “Extracting INT8 Multipliers from INT18 Multipliers,” in International Conference on Field Programmable Logic and Applications (FPL), 2019, pp. 114–120.
  71. M. Langhammer, E. Nurvitadhi, B. Pasca, and S. Gribok, “Stratix 10 NX Architecture and Applications,” in International Symposium on Field-Programmable Gate Arrays (FPGA), 2021, pp. 57–67.
  72. M. Stan, M. Hall, M. Ibrahim, and V. Betz, “HPIPE NX: Boosting CNN Inference Acceleration Performance with AI-Optimized FPGAs,” in International Conference on Field-Programmable Technology (FPT), 2022, pp. 1–9.
  73. A. Boutros and V. Betz, “FPGA Architecture: Principles and Progression,” IEEE Circuits and Systems Magazine, vol. 21, no. 2, pp. 4–29, 2021.
  74. K. E. Murray, J. Luu, M. J. Walker, C. McCullough, S. Wang, S. Huda, B. Yan, C. Chiasson, K. B. Kent, J. Anderson et al., “Optimizing FPGA Logic Block Architectures for Arithmetic,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 28, no. 6, pp. 1378–1391, 2020.
  75. A. Boutros, F. Mahmoudi, A. Mohaghegh, S. More, and V. Betz, “Into the Third Dimension: Architecture Exploration Tools for 3D Reconfigurable Acceleration Devices,” in International Conference on Field-Programmable Technology (FPT), 2023.
  76. E. Chung, J. Fowers, K. Ovtcharov, M. Papamichael, A. Caulfield, T. Massengil, M. Liu, D. Lo, S. Alkalay, M. Haselman et al., “Accelerating Persistent Neural Networks at Datacenter Scale,” in Hot Chips, vol. 29, 2017.
  77. Y. Zhang, J. Pan, X. Liu, H. Chen, D. Chen, and Z. Zhang, “FracBNN: Accurate and FPGA-Efficient Binary Neural Networks with Fractional Activations,” in International Symposium on Field-Programmable Gate Arrays (FPGA), 2021, pp. 171–182.
  78. Intel Corp., “Intel Agilex Embedded Memory User Guide (UG-20208),” 2022.
  79. AMD Inc., “Versal ACAP Memory Resources (AM007 v1.1),” 2020.
  80. S. Yazdanshenas, K. Tatsumura, and V. Betz, “Don’t Forget the Memory: Automatic Block RAM Modelling, Optimization, and Architecture Exploration,” in International Symposium on Field-Programmable Gate Arrays (FPGA), 2017, pp. 115–124.
  81. J. H. Lau, “Recent Advances and Trends in Advanced Packaging,” IEEE Transactions on Components, Packaging and Manufacturing Technology, vol. 12, no. 2, pp. 228–252, 2022.
  82. C. Ravishankar, D. Gaitonde, and T. Bauer, “Placement Strategies for 2.5D FPGA Fabric Architectures,” in International Conference on Field Programmable Logic and Applications (FPL), 2018, pp. 16–164.
  83. R. Mahajan, R. Sankman, N. Patel, D.-W. Kim, K. Aygun, Z. Qian, Y. Mekonnen, I. Salama, S. Sharan, D. Iyengar et al., “Embedded Multi-Die Interconnect Bridge (EMIB): A High Density, High Bandwidth Packaging Interconnect,” in Electronic Components and Technology Conference (ECTC), 2016, pp. 557–565.
  84. Greenhill, David and Ho, Ron and Lewis, David and Schmit, Herman and Chan, Kok Hong and Tong, Andy and Atsatt, Sean and How, Dana and McElheny, Peter and Duwel, Keith and others, “A 14nm 1GHz FPGA with 2.5D Transceiver Integration,” in 2017 IEEE International Solid-State Circuits Conference (ISSCC), 2017, pp. 54–55.
  85. B. Gaide, D. Gaitonde, C. Ravishankar, and T. Bauer, “Xilinx Adaptive Compute Acceleration Platform: Versal Architecture,” in International Symposium on Field-Programmable Gate Arrays (FPGA), 2019, pp. 84–93.
  86. I. Swarbrick, D. Gaitonde, S. Ahmad, B. Gaide, and Y. Arbel, “Network-on-Chip Programmable Platform in Versal ACAP Architecture,” in International Symposium on Field-Programmable Gate Arrays (FPGA), 2019.
  87. K. E. Murray, O. Petelin, S. Zhong, J. M. Wang, M. Eldafrawy, J.-P. Legault, E. Sha, A. G. Graham, J. Wu, M. J. Walker et al., “VTR 8: High-Performance CAD and Customizable FPGA Architecture Modelling,” ACM Transactions on Reconfigurable Technology and Systems (TRETS), vol. 13, no. 2, pp. 1–55, 2020.
  88. P. Jamieson, K. B. Kent, F. Gharibian, and L. Shannon, “Odin II: An Open-Source Verilog HDL Synthesis Tool for CAD Research,” in International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2010, pp. 149–156.
  89. C. Wolf and J. Glaser, “Yosys: A Free Verilog Synthesis Suite,” in Austrian Workshop on Microelectronics (AustroChip), 2013.
  90. R. Brayton and A. Mishchenko, “ABC: An Academic Industrial-Strength Verification Tool,” in International Conference on Computer Aided Verification (CAV), 2010, pp. 24–40.
  91. K. E. Murray and V. Betz, “Tatum: Parallel Timing Analysis for Faster Design Cycles and Improved Optimization,” in International Conference on Field-Programmable Technology (FPT), 2018, pp. 110–117.
  92. S. Yazdanshenas and V. Betz, “COFFE 2: Automatic Modelling and Optimization of Complex and Heterogeneous FPGA Architectures,” ACM Transactions on Reconfigurable Technology and Systems (TRETS), vol. 12, no. 1, pp. 1–27, 2019.
  93. K. E. Murray, S. Whitty, S. Liu, J. Luu, and V. Betz, “Titan: Enabling Large and Complex Benchmarks in Academic CAD,” in International Conference on Field programmable Logic and Applications (FPL), 2013, pp. 1–8.
  94. A. Arora, A. Boutros, S. A. Damghani, K. Mathur, V. Mohanty, T. Anand, M. A. Elgammal, K. B. Kent, V. Betz, and L. K. John, “Koios 2.0: Open-Source Deep Learning Benchmarks for FPGA Architecture and CAD Research,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD), 2023.
  95. H. Wu, P. Judd, X. Zhang, M. Isaev, and P. Micikevicius, “Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation,” arXiv preprint arXiv:2004.09602, 2020.
  96. D. Abts, J. Ross, J. Sparling, M. Wong-VanHaren, M. Baker, T. Hawkins, A. Bell, J. Thompson, T. Kahsai, G. Kimmell et al., “Think Fast: A Tensor Streaming Processor (TSP) for Accelerating Deep Learning Workloads,” in ACM/IEEE International Symposium on Computer Architecture (ISCA), 2020, pp. 145–158.
  97. M. Anderson, B. Chen, S. Chen, S. Deng, J. Fix, M. Gschwind, A. Kalaiah, C. Kim, J. Lee, J. Liang et al., “First-Generation Inference Accelerator Deployment at Facebook,” arXiv preprint arXiv:2107.04140, 2021.
  98. A. Boutros, M. Eldafrawy, S. Yazdanshenas, and V. Betz, “Math Doesn’t Have to Be Hard: Logic Block Architectures to Enhance Low-Precision Multiply-Accumulate on FPGAs,” in ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA), 2019, pp. 94–103.
  99. M. Eldafrawy, A. Boutros, S. Yazdanshenas, and V. Betz, “FPGA Logic Block Architectures for Efficient Deep Learning Inference,” ACM Transactions on Reconfigurable Technology and Systems (TRETS), vol. 13, no. 3, pp. 1–34, 2020.
  100. S. Yazdanshenas and T. Vanderhoek, “Efficient Logic Blocks Architectures for Dense Mapping of Multipliers,” 2021, US Patent App. 16/729,256.
  101. S. Rasoulinezhad, Siddhartha, H. Zhou, L. Wang, D. Boland, and P. H. Leong, “LUXOR: An FPGA Logic Cell Architecture for Efficient Compressor Tree Implementations,” in ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA), 2020, pp. 161–171.
  102. J. H. Kim, J. Lee, and J. H. Anderson, “FPGA Architecture Enhancements for Efficient BNN Implementation,” in International Conference on Field-Programmable Technology (FPT), 2018.
  103. A. Boutros, S. Yazdanshenas, and V. Betz, “Embracing Diversity: Enhanced DSP Blocks for Low-Precision Deep Learning on FPGAs,” in International Conference on Field Programmable Logic and Applications (FPL), 2018, pp. 35–357.
  104. S. Rasoulinezhad, H. Zhou, L. Wang, and P. H. Leong, “PIR-DSP: An FPGA DSP Block Architecture for Multi-Precision Deep Neural Networks,” in International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2019, pp. 35–44.
  105. K. Tatsumura, S. Yazdanshenas, and V. Betz, “High Density, Low Energy, Magnetic Tunnel Junction Based Block RAMs for Memory-Rich FPGAs,” in IEEE International Conference on Field-Programmable Technology (FPT), 2016.
  106. X. Wang, V. Goyal, J. Yu, V. Bertacco, A. Boutros, E. Nurvitadhi, C. Augustine, R. Iyer, and R. Das, “Compute-Capable Block RAMs for Efficient Deep Learning Acceleration on FPGAs,” in IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2021.
  107. A. Arora, B. Hanindhito, and L. K. John, “Compute RAMs: Adaptable Compute and Storage Blocks for DL-Optimized FPGAs,” in IEEE Asilomar Conference on Signals, Systems, and Computers, 2021.
  108. A. Arora, T. Anand, A. Borda, R. Sehgal, B. Hanindhito, J. Kulkarni, and L. K. John, “CoMeFa: Compute-in-Memory Blocks for FPGAs,” in IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2022.
  109. Y. Chen and M. S. Abdelfattah, “BRAMAC: Compute-in-BRAM Architectures for Multiply-Accumulate on FPGAs,” in IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2023.
  110. Y. Chen, J. Dotzel, and M. S. Abdelfattah, “M4BRAM: Mixed-Precision Matrix-Matrix Multiplication in FPGA Block RAMs,” in IEEE International Conference on Field-Programmable Technology (FPT), 2023.
  111. C. Eckert, X. Wang, J. Wang, A. Subramaniyan, R. Iyer, D. Sylvester, D. Blaaauw, and R. Das, “Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks,” in ACM/IEEE International Symposium on Computer Architecture (ISCA), 2018, pp. 383–396.
  112. A. Arora, S. Mehta, V. Betz, and L. K. John, “Tensor Slices to the Rescue: Supercharging ML Acceleration on FPGAs,” in The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2021, pp. 23–33.
  113. A. Arora, M. Ghosh, S. Mehta, V. Betz, and L. K. John, “Tensor Slices: FPGA Building Blocks for the Deep Learning Era,” ACM Transactions on Reconfigurable Technology and Systems (TRETS), vol. 15, no. 4, pp. 1–34, 2022.
  114. D. Kalamkar, D. Mudigere, N. Mellempudi, D. Das, K. Banerjee, S. Avancha, D. T. Vooturi, N. Jammalamadaka, J. Huang, H. Yuen et al., “A Study of BFLOAT16 for Deep Learning Training,” arXiv preprint arXiv:1905.12322, 2019.
  115. Achronix Corp., “Speedster7t Machine Learning Processing User Guide (UG088),” 2019.
  116. A. Cairncross, B. Henry, C. Chalmers, D. Reid, J. Shipton, J. Fowler, L. Corrigan, and M. Ashby, “AI Benchmarking on Achronix Speedster7t FPGAs,” White Paper, 2023.
  117. M. Langhammer, E. Nurvitadhi, S. Gribok, and B. Pasca, “Stratix 10 NX Architecture,” ACM Transactions on Reconfigurable Technology and Systems (TRETS), vol. 15, no. 4, pp. 1–32, 2022.
  118. Intel Corp., “Intel Agilex® 5 FPGAs and SoCs Device Overview (762191),” 2023.
  119. A. Boutros, E. Nurvitadhi, and V. Betz, “RAD-Sim: Rapid Architecture Exploration for Novel Reconfigurable Acceleration Devices,” in International Conference on Field-Programmable Logic and Applications (FPL), 2022.
  120. M. S. Abdelfattah and V. Betz, “The Case for Embedded Networks on Chip on Field-Programmable Gate Arrays,” IEEE Micro, vol. 34, no. 1, pp. 80–89, 2013.
  121. X. Jia, Y. Zhang, G. Liu, X. Yang, T. Zhang, J. Zheng, D. Xu, H. Wang, R. Zheng, S. Pareek et al., “XVDPU: A High Performance CNN Accelerator on the Versal Platform Powered by the AI Engine,” in International Conference on Field-Programmable Logic and Applications (FPL), 2022.
  122. T. Zhang, D. Li, H. Wang, Y. Li, X. Ma, W. Luo, Y. Wang, Y. Huang, Y. Li, Y. Zhang et al., “A-U3D: A Unified 2D/3D CNN Accelerator on the Versal Platform for Disparity Estimation,” in International Conference on Field-Programmable Logic and Applications (FPL), 2022.
  123. J. Zhuang, J. Lau, H. Ye, Z. Yang, Y. Du, J. Lo, K. Denolf, S. Neuendorffer, A. Jones, J. Hu et al., “CHARM: Composing Heterogeneous Accelerators for Matrix Multiply on Versal ACAP Architecture,” in International Symposium on Field Programmable Gate Arrays (FPGA), 2023.
  124. C. Zhang, T. Geng, A. Guo, J. Tian, M. Herbordt, A. Lit, and D. Tao, “H-GCN: A Graph Convolutional Network Accelerator on Versal ACAP Architecture,” in International Conference on Field-Programmable Logic and Applications (FPL), 2022.
  125. P. Chen, P. Manjunath, S. Wijeratne, B. Zhang, and V. Prasanna, “Exploiting On-chip Heterogeneity of Versal Architecture for GNN Inference Acceleration,” in International Conference on Field-Programmable Logic and Applications (FPL), 2023.
  126. E. Taka, A. Arora, K.-C. Wu, and D. Marculescu, “MaxEVA: Maximizing the Efficiency of Matrix Multiplication on Versal AI Engine,” arXiv preprint arXiv:2311.04980, 2023.
  127. N. Brown, “Exploring the Versal AI Engines for Accelerating Stencil-Based Atmospheric Advection Simulation,” in International Symposium on Field Programmable Gate Arrays (FPGA), 2023.
  128. G. Singh, A. Khodamoradi, K. Denolf, J. Lo, J. Gomez-Luna, J. Melber, A. Bisca, H. Corporaal, and O. Mutlu, “SPARTA: Spatial Acceleration for Efficient and Scalable Horizontal Diffusion Weather Stencil Computation,” in International Conference on Supercomputing (SC), 2023.
  129. A. Boutros, E. Nurvitadhi, and V. Betz, “Architecture and Application Co-Design for Beyond-FPGA Reconfigurable Acceleration Devices,” IEEE Access, vol. 10, pp. 95 067–95 082, 2022.
  130. A. Boutros, S. More, and V. Betz, “A Whole New World: How to Architect Beyond-FPGA Reconfigurable Acceleration Devices?” in International Conference on Field-Programmable Logic and Applications (FPL), 2023.
  131. S. Srinivasan, A. Boutros, F. Mahmoudi, and V. Betz, “Placement Optimization for NoC-Enhanced FPGAs,” in International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2023.
  132. E. Nurvitadhi, J. Cook, A. Mishra, D. Marr, K. Nealis, P. Colangelo, A. Ling, D. Capalija, U. Aydonat, A. Dasu et al., “In-Package Domain-Specific ASICs for Intel Stratix 10 FPGAs: A Case Study of Accelerating Deep Learning using TensorTile ASIC,” in International Conference on Field Programmable Logic and Applications (FPL), 2018.
  133. D. Ingerly et al., “Foveros: 3D Integration and the Use of Face-to-Face Chip Stacking for Logic Devices,” in IEEE International Electron Devices Meeting (IEDM), 2019.
  134. L. Su, “AMD Keynote,” in Consumer Electronics Show (CES), 2023. [Online]. Available: https://tinyurl.com/amdceskeynote
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Andrew Boutros (5 papers)
  2. Aman Arora (17 papers)
  3. Vaughn Betz (6 papers)
Citations (3)
Reddit Logo Streamline Icon: https://streamlinehq.com