HLSTransform: Energy-Efficient Llama 2 Inference on FPGAs Via High Level Synthesis (2405.00738v1)

Published 29 Apr 2024 in cs.AR, cs.AI, and cs.LG

Abstract: Graphics Processing Units (GPUs) have become the leading hardware accelerator for deep learning applications and are used widely in training and inference of transformers; transformers have achieved state-of-the-art performance in many areas of machine learning and are especially used in most modern LLMs. However, GPUs require large amounts of energy, which poses environmental concerns, demands high operational costs, and causes GPUs to be unsuitable for edge computing. We develop an accelerator for transformers, namely, Llama 2, an open-source state-of-the-art LLM, using high level synthesis (HLS) on Field Programmable Gate Arrays (FPGAs). HLS allows us to rapidly prototype FPGA designs without writing code at the register-transfer level (RTL). We name our method HLSTransform, and the FPGA designs we synthesize with HLS achieve up to a 12.75x reduction and 8.25x reduction in energy used per token on the Xilinx Virtex UltraScale+ VU9P FPGA compared to an Intel Xeon Broadwell E5-2686 v4 CPU and NVIDIA RTX 3090 GPU respectively, while increasing inference speeds by up to 2.46x compared to CPU and maintaining 0.53x the speed of an RTX 3090 GPU despite the GPU's 4 times higher base clock rate. With the lack of existing open-source FPGA accelerators for transformers, we open-source our code and document our steps for synthesis. We hope this work will serve as a step in democratizing the use of FPGAs in transformer inference and inspire research into energy-efficient inference methods as a whole. The code can be found on https://github.com/HLSTransform/submission.

PDF Abstract

Harnessing FPGAs for Efficient AI Inference with HLSTransform

The rapidly evolving landscape of AI demands hardware that not only accelerates computation but also optimizes energy use. The traditional stalwarts in this arena have been CPUs and GPUs, but as the quest for sustainability intensifies, alternatives like Field Programmable Gate Arrays (FPGAs) are gaining attention. A paper from Cornell University introduces HLSTransform, a method leveraging high-level synthesis (HLS) on FPGAs to efficiently run inference processes for Llama 2, a popular LLM.

Understanding the Shift from GPUs to FPGAs

What's the Problem with GPUs?

GPUs, although powerful and widely used in machine learning tasks, draw significant energy. To put it into perspective, the environmental impact is profound, with huge carbon footprints ensuing from their operation — to the tune of hundreds of tons of carbon dioxide for training models like Llama 2.

Why Consider FPGAs?

FPGAs are known for their reconfigurability and energy efficiency, consuming considerably less power compared to GPUs. The flexible nature of FPGAs, capable of being programmed for specific tasks, offers a fresh avenue for building environmentally friendly AI systems. However, traditionally, programming FPGAs has been a high barrier because it required intricate hardware description expertise.

HLSTransform: Bridging the Complexity with HLS

The innovative approach taken in HLSTransform uses HLS to ease the FPGA programming challenge, allowing developers to describe hardware with higher-level programming languages that are easier and quicker to prototype with.

Key Outcomes with HLSTransform

The adjusted FPGA designs managed to:

Reduce energy per token by up to 12.75x compared to CPUs and 8.25x compared to GPUs.
Enhance inference speeds up to 2.46x compared to CPUs.
Maintain operational integrity with speeds at approximately half of what the fastest GPUs offer, which is impressive considering the inherent disadvantages in processing speeds and memory within FPGAs compared to GPUs.

These benchmarks signify not just operational efficiency but also point towards significant reductions in power and energy consumption, advocating for a more sustainable model of computing.

Project Outcomes and Contributions

Open-Sourcing for Broader Impact

Recognizing the gap in FPGA-related resources for accelerating LLMs, the team has open-sourced their method. This initiative paves the way for wider adoption and research into FPGA as a viable platform for LLM inference, potentially setting a new standard in hardware accelerator use within the AI field.

Practical Implications

In practical terms, HLSTransform opens up opportunities for AI applications where either data sensitivity or connectivity issues make cloud-based computations infeasible. The deployment of FPGAs could be particularly transformative in edge computing scenarios where power availability and data processing needs must be balanced efficiently.

The Road Ahead

Future Enhancements

While the results are promising, there are limitations primarily around the size of the LLM that can be handled due to FPGA's memory constraints. Future research could explore more advanced quantization techniques or multi-FPGA systems to handle larger models effectively.

Broadening the Use Cases

Considering how the current setup focuses on single-instance inference, exploring the efficacy of HLSTransform in batch processing scenarios might fill another critical gap, enhancing throughput for large-scale AI tasks without compromising on the power efficiency front.

In conclusion, while GPUs currently dominate AI hardware acceleration, the exploration into FPGA with methods like HLSTransform showcases a promising alternative that doesn't just match up in terms of computational performance but excels in energy efficiency. This could herald a critical shift towards more sustainable AI practices, something the global environment sorely needs.