Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

From Words to Watts: Benchmarking the Energy Costs of Large Language Model Inference (2310.03003v1)

Published 4 Oct 2023 in cs.CL and cs.DC
From Words to Watts: Benchmarking the Energy Costs of Large Language Model Inference

Abstract: LLMs have exploded in popularity due to their new generative capabilities that go far beyond prior state-of-the-art. These technologies are increasingly being leveraged in various domains such as law, finance, and medicine. However, these models carry significant computational challenges, especially the compute and energy costs required for inference. Inference energy costs already receive less attention than the energy costs of training LLMs -- despite how often these large models are called on to conduct inference in reality (e.g., ChatGPT). As these state-of-the-art LLMs see increasing usage and deployment in various domains, a better understanding of their resource utilization is crucial for cost-savings, scaling performance, efficient hardware usage, and optimal inference strategies. In this paper, we describe experiments conducted to study the computational and energy utilization of inference with LLMs. We benchmark and conduct a preliminary analysis of the inference performance and inference energy costs of different sizes of LLaMA -- a recent state-of-the-art LLM -- developed by Meta AI on two generations of popular GPUs (NVIDIA V100 & A100) and two datasets (Alpaca and GSM8K) to reflect the diverse set of tasks/benchmarks for LLMs in research and practice. We present the results of multi-node, multi-GPU inference using model sharding across up to 32 GPUs. To our knowledge, our work is the one of the first to study LLM inference performance from the perspective of computational and energy resources at this scale.

Benchmarking the Energy Costs of LLM Inference

The paper "From Words to Watts: Benchmarking the Energy Costs of LLM Inference" presents an empirical evaluation of the computational and energy requirements for inference using LLMs. Recognizing the growing deployment of these models in sectors such as law, finance, and medicine, the paper focuses on the lesser-explored domain of inference energy costs. Although the computational demand and energy consumption involved in training LLMs have previously been investigated, this paper aims to fill the gap concerning inference, which occurs more frequently in practical applications.

Experimental Setup and Models

Experiments were conducted using different sizes of the LLaMA model developed by Meta AI, with comprehensive benchmarking on both NVIDIA V100 and A100 GPUs. The paper utilizes two datasets, Alpaca and GSM8K, to assess performance across a diverse set of tasks, reflecting real-world applications of LLMs. The experiments include distributed computing scenarios with up to 32 GPUs to analyze LLaMA 65B—one of the largest available variants of this model.

Overview of Findings

Inference Performance

The inference performance was evaluated in terms of latency metrics such as words, tokens, and responses per second. Key findings indicate that:

  1. GPU Comparison: The A100 GPU outperforms the V100 GPU, particularly for smaller model variants (7B and 13B), where a notable increase in inference latency is observed. Specifically, a twofold improvement for the 7B model and a 1.25-fold improvement for the 13B model were found when using the A100 GPU compared to the V100 GPU.
  2. Scaling to Larger Models: Improvements in inference performance are less pronounced for the 65B LLaMA model. This diminished improvement is likely due to the increased communication overhead when scaling across multiple nodes, as the larger models necessitate more GPUs.

Energy Consumption

The paper explores different dimensions of energy consumption, mainly focusing on energy per second, energy per token, and energy per response. The results indicate that:

  1. Impact of Shards: Increasing the number of shards (i.e., GPUs) consistently leads to higher energy consumption, both in terms of energy per second and energy per token. This suggests a clear trade-off between computational latency and energy efficiency.
  2. Batch Size and Generation Length: Energy consumption shows slight variations with different batch sizes, but increasing the maximum generation length (from 512 to 1024 tokens) does not significantly alter energy metrics. However, a "sweet spot" was observed for certain configurations, particularly for the GSM8K dataset at 16 shards.

Power Capping

The paper also evaluates the effect of GPU power capping on inference energy efficiency. Results show that:

  1. Energy Savings: Implementing a power cap of 175W on A100 GPUs led to only a 6.7% increase in inference time while yielding a significant 23.21% reduction in total energy consumed. However, further reduction to a 150W power cap resulted in a 19.49% increase in inference time, indicating diminishing returns on energy savings with more aggressive power capping.

Implications and Future Work

The findings of this paper hold notable practical implications:

  1. Resource Allocation: Efficient hardware resource allocation strategies can be developed to balance between inference latency and energy usage, particularly in power-constrained environments such as mobile and edge computing.
  2. Model Optimization: The findings stress the necessity for further research into model optimization techniques such as quantization, distillation, and sparsification to enhance energy efficiency without compromising performance.
  3. Infrastructure Planning: The demonstrated effectiveness of GPU power capping suggests potential strategies for reducing energy consumption at a datacenter scale, contributing to Greener AI initiatives.

Theoretical Implications

The paper broadens the understanding of LLM inference performance and energy dynamics. Future research may extend these insights to develop more energy-efficient algorithms and hardware optimization strategies for large-scale AI deployments. Additionally, the preliminary findings on optimal shard configurations for different types of workloads warrant further investigation to derive generalized principles applicable across various model architectures and deployment environments.

Conclusion

This paper provides a comprehensive benchmark of inference energy costs for LLaMA, encapsulating performance metrics across different configurations and computational environments. By highlighting the energy consumption characteristics and proposing potential optimization strategies, the paper contributes valuable insights that can inform both practical implementations and ongoing research in the field of large-scale AI. The examination of the impact of power capping also opens up avenues for future work aimed at reducing the environmental footprint of AI technologies.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Siddharth Samsi (74 papers)
  2. Dan Zhao (50 papers)
  3. Joseph McDonald (17 papers)
  4. Baolin Li (15 papers)
  5. Adam Michaleas (5 papers)
  6. Michael Jones (92 papers)
  7. William Bergeron (23 papers)
  8. Jeremy Kepner (141 papers)
  9. Devesh Tiwari (31 papers)
  10. Vijay Gadepally (131 papers)
Citations (73)
Youtube Logo Streamline Icon: https://streamlinehq.com