An Inquiry into Datacenter TCO for LLM Inference with FP8 (2502.01070v3)
Abstract: As LLMs continue to scale, their inference demands present significant challenges, particularly due to the high power consumption of AI accelerators in datacenters. These facilities require specialized cooling and power management systems, substantially increasing the total cost of ownership (TCO) for cloud service providers (CSPs). In this work, we analyze the computational characteristics and constraints of LLM inference from a TCO perspective, focusing on two representative accelerators: the Gaudi 2 and NVIDIA H100. We present a generalizable framework that enables CSPs to compare and select AI accelerators according to diverse operational requirements. Using this model, we analyze the impact of FP8 precision and LLM inference workload characteristics as key factors influencing TCO. We investigate FP8 quantization, which is gaining adoption in LLM training, as a technique to improve inference throughput while maintaining cost efficiency. Furthermore, our analysis of LLM inference workloads reveals that performance on thin GEMMs, which dominate the decode phase, can have a greater impact than theoretical hardware peak performance. By studying the interaction between power consumption, quantization strategies, and hardware architecture, we offer insights that support informed deployment decisions and guide future accelerator designs to improve the TCO of LLM inference.
- Jiwoo Kim (8 papers)
- Joonhyung Lee (9 papers)
- Gunho Park (5 papers)
- Byeongwook Kim (21 papers)
- Se Jung Kwon (26 papers)
- Dongsoo Lee (30 papers)
- Youngjoo Lee (12 papers)