Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Faster Inference of LLMs using FP8 on the Intel Gaudi (2503.09975v3)

Published 13 Mar 2025 in cs.AR

Abstract: Low-precision data types are essential in modern neural networks during both training and inference as they enhance throughput and computational capacity by better exploiting available hardware resources. Despite the incorporation of FP8 in commercially available neural network accelerators, a comprehensive exposition of its underlying mechanisms, along with rigorous performance and accuracy evaluations, is still lacking. In this work, we contribute in three significant ways. First, we analyze the implementation details and quantization options associated with FP8 for inference on the Intel Gaudi AI accelerator. Second, we empirically quantify the throughput improvements afforded by the use of FP8 at both the operator level and in end-to-end scenarios. Third, we assess the accuracy impact of various FP8 quantization methods. Our experimental results indicate that the Intel Gaudi 2 accelerator consistently achieves high computational unit utilization, frequently exceeding 90% MFU, while incurring an accuracy degradation of less than 1%.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (12)
  1. Joonhyung Lee (9 papers)
  2. Shmulik Markovich-Golan (4 papers)
  3. Daniel Ohayon (3 papers)
  4. Yair Hanani (6 papers)
  5. Gunho Park (5 papers)
  6. Byeongwook Kim (21 papers)
  7. Asaf Karnieli (5 papers)
  8. Uri Livne (1 paper)
  9. Haihao Shen (11 papers)
  10. Tai Huang (1 paper)
  11. Se Jung Kwon (26 papers)
  12. Dongsoo Lee (30 papers)

Summary

  • The paper investigates implementing and evaluating FP8 numerical data types on Intel Gaudi accelerators to significantly enhance LLM inference throughput and efficiency.
  • Experimental results show Intel Gaudi 2 achieves over 90% Matrix Fused Utilization with less than a 1% accuracy reduction using FP8.
  • The FP8 format enables handling large models like Llama v3.1 70B on single Gaudi 2 accelerators due to substantial memory savings.

Faster Inference of LLMs using FP8 on the Intel Gaudi

The research paper titled "Faster Inference of LLMs using FP8 on the Intel Gaudi" investigates methods to enhance the throughput and computational efficiency of LLMs using FP8 numerical data types on Intel’s Gaudi accelerators. The FP8 data format, with its reduced bit-width, offers distinct advantages in minimizing memory requirements and facilitating higher computational throughput compared to traditional 16-bit formats. This paper provides a comprehensive examination of the implementation details of FP8 and evaluates its performance across various model sizes and tasks.

The authors contribute to the field by providing a detailed analysis of the FP8 implementation on Intel Gaudi AI accelerators, including the strategy for scaled matrix multiplication and quantization. Furthermore, they measure the throughput improvements using FP8 and evaluate the impact on inference accuracy. The experimental outcomes reveal that the Gaudi 2 accelerator can achieve over 90% Matrix Fused Utilization (MFU) with a slight accuracy reduction of less than 1%, highlighting the potential of FP8 to maintain a balance between efficiency and accuracy.

The paper categorizes scaling factors into static and dynamic, with empirical evidence supporting the use of finely-tuned configurations for achieving optimal results. Specifically, the findings emphasize the dependency of model performance on the tasks performed and highlight the resilience of larger models to quantization, indicative of inherent redundancies that preserve accuracy even at reduced precision.

From a practical perspective, the application of FP8 on LLMs using Gaudi hardware demonstrates significant improvements in decoding and prefill throughput across a spectrum of sequence lengths. Notably, the memory savings afforded by FP8 allow for handling large models, such as Llama v3.1 70B, on single Gaudi 2 accelerators, underscoring its tangible benefits in real-world applications.

Future research directions could potentially explore further refinements in FP8 configurations, addressing areas like stochastic rounding and stochastic quantization techniques, to further minimize accuracy losses while maximizing hardware utilization. The broader implications of this paper could have essential ramifications for the deployment and scalability of AI systems, particularly in data-intensive applications where computational efficiency is crucial.

Overall, this paper provides robust evidence supporting the integration of FP8 into the neural computing ecosystem, illustrating its advantages on Intel’s Gaudi hardware. The methodologies and results presented herein serve as a valuable reference for further exploration and maximization of FP8 capabilities in artificial intelligence and machine learning tasks.

X Twitter Logo Streamline Icon: https://streamlinehq.com