Energy consumption of code small language models serving with runtime engines and execution providers (2412.15441v1)

Published 19 Dec 2024 in cs.SE, cs.AI, and cs.LG

Abstract: Background. The rapid growth of LLMs (LMs), particularly in code generation, requires substantial computational resources, raising concerns about energy consumption and environmental impact. Optimizing LMs inference for energy efficiency is crucial, and Small LLMs (SLMs) offer a promising solution to reduce resource demands. Aim. Our goal is to analyze the impact of deep learning runtime engines and execution providers on energy consumption, execution time, and computing-resource utilization from the point of view of software engineers conducting inference in the context of code SLMs. Method. We conducted a technology-oriented, multi-stage experimental pipeline using twelve code generation SLMs to investigate energy consumption, execution time, and computing-resource utilization across the configurations. Results. Significant differences emerged across configurations. CUDA execution provider configurations outperformed CPU execution provider configurations in both energy consumption and execution time. Among the configurations, TORCH paired with CUDA demonstrated the greatest energy efficiency, achieving energy savings from 37.99% up to 89.16% compared to other serving configurations. Similarly, optimized runtime engines like ONNX with the CPU execution provider achieved from 8.98% up to 72.04% energy savings within CPU-based configurations. Also, TORCH paired with CUDA exhibited efficient computing-resource utilization. Conclusions. Serving configuration choice significantly impacts energy efficiency. While further research is needed, we recommend the above configurations best suited to software engineers' requirements for enhancing serving efficiency in energy and performance.

PDF Abstract

Energy Consumption of Code Small LLMs Serving with Runtime Engines and Execution Providers

The paper "Energy consumption of code small LLMs serving with runtime engines and execution providers" provides an insightful analysis of energy consumption, execution time, and computing-resource utilization in the context of serving Small LLMs (SLMs) for code generation. With the increasing adoption of deep learning, particularly LLMs, the need to address their substantial energy footprint has become paramount. Small LLMs (SLMs) are proposed as a viable alternative to reduce computational demands without significantly sacrificing performance.

The paper investigates how different runtime engines and execution providers impact the efficiency of inference, focusing on energy usage, execution time, and resource utilization. The research is motivated by the growing environmental concerns associated with the computational intensiveness of LMs and aims to provide actionable insights for software engineers concerned with these issues.

Methodology

The authors conducted a rigorous multi-stage experimental pipeline, utilizing twelve code generation SLMs to evaluate multiple configurations across runtime engines and execution providers. The configurations examined include Torch, ONNX Runtime, OpenVINO Runtime, and Torch JIT, tested with both CPU and CUDA execution providers. Each configuration's impact on energy consumption, execution time, and computing-resource utilization was analyzed using a dataset derived from the HumanEval benchmark, focusing on code generation tasks.

Key Findings

Energy Consumption: CUDA execution provider configurations significantly outperformed CPU counterparts in terms of energy efficiency. Notably, the combination of TORCH paired with CUDA demonstrated the highest energy savings, with reductions ranging from approximately 38% to 89% compared to other configurations. This highlights CUDA's effectiveness in harnessing GPU capabilities for energy-efficient inference.
Execution Time: Correspondingly, CUDA configurations also achieved marked reductions in execution time, further underscoring their efficiency advantage over CPU-based setups. The TORCH and CUDA pairing not only minimized energy consumption but also optimized execution speed, making it the most effective configuration overall.
Resource Utilization: The paper emphasizes the importance of choosing the right serving configuration to optimize resource utilization. While CUDA configurations led to efficient resource usage and minimized bottlenecks typically seen with high CPU or RAM demands in CPU execution provider configurations, ONNX Runtime exhibited effective optimization within CPU settings, resulting in significant resource efficiency gains.

Implications

The paper's findings have significant implications for both the deployment of ML systems and sustainable AI practices. By demonstrating substantial energy and time efficiency gains with CUDA configurations, particularly for code generation tasks using SLMs, the research provides clear evidence for practitioners to prioritize GPU-accelerated environments. This has the potential to lower operational costs and reduce the environmental impact of deploying LLMs at scale. Additionally, the analysis of runtime engines and execution providers contributes valuable insights into optimizing ML systems' infrastructure.

The paper concludes that while further research is required, the recommendations provided can assist software engineers in selecting configurations that enhance serving efficiency. This work also paves the way for future exploration into other areas of AI, suggesting that similar evaluations be extended to various ML tasks and different model scales beyond code generation. Developing efficient serving infrastructures and refining execution strategies remain crucial for achieving greener, yet performant, AI solutions.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Francisco Durán (8 papers)
Matias Martinez (51 papers)
Patricia Lago (27 papers)
Silverio Martínez-Fernández (32 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/ComputerPapers/status/1871118056757367073