Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 37 tok/s Pro
GPT-5 High 35 tok/s Pro
GPT-4o 125 tok/s Pro
Kimi K2 203 tok/s Pro
GPT OSS 120B 429 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Generating Energy-Efficient Code via Large-Language Models -- Where are we now? (2509.10099v1)

Published 12 Sep 2025 in cs.SE and cs.AI

Abstract: Context. The rise of LLMs has led to their widespread adoption in development pipelines. Goal. We empirically assess the energy efficiency of Python code generated by LLMs against human-written code and code developed by a Green software expert. Method. We test 363 solutions to 9 coding problems from the EvoEval benchmark using 6 widespread LLMs with 4 prompting techniques, and comparing them to human-developed solutions. Energy consumption is measured on three different hardware platforms: a server, a PC, and a Raspberry Pi for a total of ~881h (36.7 days). Results. Human solutions are 16% more energy-efficient on the server and 3% on the Raspberry Pi, while LLMs outperform human developers by 25% on the PC. Prompting does not consistently lead to energy savings, where the most energy-efficient prompts vary by hardware platform. The code developed by a Green software expert is consistently more energy-efficient by at least 17% to 30% against all LLMs on all hardware platforms. Conclusions. Even though LLMs exhibit relatively good code generation capabilities, no LLM-generated code was more energy-efficient than that of an experienced Green software developer, suggesting that as of today there is still a great need of human expertise for developing energy-efficient Python code.

Summary

  • The paper presents a systematic evaluation comparing LLM-generated code with human-developed green solutions across multiple hardware platforms.
  • It analyzes four prompting strategies and measures energy consumption using rigorous statistical methods across six state-of-the-art LLMs.
  • The study highlights that hardware context and expert coding remain critical for achieving optimal energy efficiency in Python implementations.

Generating Energy-Efficient Code via Large-LLMs: An Empirical Assessment

Introduction

This paper presents a comprehensive empirical paper on the energy efficiency of Python code generated by LLMs, comparing their output to human-written code and code developed by a Green software expert. The paper is motivated by the increasing adoption of LLMs in software engineering pipelines and the growing importance of sustainability and energy efficiency in software development. The authors systematically evaluate six state-of-the-art LLMs using four prompting techniques across three hardware platforms (server, PC, Raspberry Pi), measuring the energy consumption of 363 solutions to nine coding problems from the EvoEval benchmark. The paper provides statistically robust insights into the current capabilities and limitations of LLMs in generating energy-efficient code. Figure 1

Figure 1: Overview of paper phases, including LLM selection, guideline elicitation, code generation, and energy measurement across hardware platforms.

Experimental Design and Methodology

The paper employs a rigorous experimental protocol, including:

  • LLM Selection: Six models (GPT4, ChatGPT, DeepSeek Coder 33B, Speechless Codellama 34B, Code Millenials 34B, WizardCoder 33B) are chosen for diversity in architecture and accessibility.
  • Benchmark Problems: Nine computationally challenging Python problems from EvoEval are selected, ensuring all LLMs can generate functionally correct solutions.
  • Prompt Engineering: Four prompting strategies are evaluated: base (functional), keyword ("energy efficient"), hardware-specific, guideline-based (using 10 systematically derived green coding guidelines), and few-shot learning.
  • Human Baselines: Canonical solutions from EvoEval and expert-developed green solutions are included for comparison.
  • Energy Measurement: Solutions are executed on three platforms (server, PC, Raspberry Pi) with high-frequency energy sampling (up to 5000Hz), totaling ~881 hours and 4.6 billion data points.
  • Statistical Analysis: Non-parametric ART ANOVA and Cliff's Delta are used to assess significance and effect sizes, with post-hoc corrections.

Key Findings

LLM Energy Efficiency Varies by Model and Hardware

Energy consumption of LLM-generated code exhibits significant variation across models and hardware platforms. On the server, Speechless Codellama produces the most energy-efficient code, while Code Millenials and WizardCoder excel on PC and Raspberry Pi, respectively. GPT4 consistently generates the least energy-efficient code across all platforms. The effect sizes for these differences are large, indicating practical significance. Figure 2

Figure 2: Energy distributions for canonical (human) and functional (LLM-generated) solutions across hardware platforms.

Human vs. LLM-Generated Code

Canonical human solutions are 16% more energy-efficient than LLM-generated code on the server and 3% more efficient on the Raspberry Pi, but LLMs outperform humans by 25% on the PC. The difference on the Raspberry Pi is statistically insignificant. The type of LLM does not heavily influence this comparison, suggesting that hardware context is a dominant factor.

Prompt Engineering Has Limited Impact

Prompting strategies, including energy-specific keywords, hardware details, green guidelines, and few-shot examples, do not consistently yield more energy-efficient code. The most energy-efficient prompt varies by hardware, and the effect sizes for prompt-induced differences are generally small or negligible. Notably, guideline and few-shot prompts increase the variance in energy consumption, indicating unpredictable LLM behavior when given explicit energy-saving instructions. Figure 3

Figure 3: Energy consumption on the server for solutions generated via different prompting techniques, showing high variance for guideline and few-shot prompts.

Green Expert Code Sets the Lower Bound

Code developed by a Green software expert is consistently more energy-efficient than all LLM-generated solutions, with a margin of 17–30% across platforms. In rare cases, Speechless Codellama and WizardCoder, when prompted with guidelines or few-shot examples, produce solutions that rival or slightly outperform expert code, but no LLM-prompt combination consistently matches expert performance across all hardware. Figure 4

Figure 4: Comparison of energy usage between Green expert solutions and LLM-generated solutions across platforms.

Implications

For Developers

  • Critical Evaluation Required: LLM-generated code's energy efficiency is highly context-dependent. Developers should not assume that code generated by LLMs is optimal for their target hardware.
  • Manual Refinement Necessary: Reviewing and refining LLM-generated code using green coding guidelines remains essential for energy-critical applications.
  • Prompt Engineering Limitations: Current prompt engineering techniques do not reliably improve energy efficiency; hardware-aware or guideline-based prompts may increase code variance rather than guarantee improvements.

For LLM Vendors

  • Non-Functional Metrics Needed: LLMs should be evaluated and fine-tuned for non-functional properties such as energy efficiency, not just correctness.
  • Green Code Generation as a Competitive Advantage: Vendors should invest in model architectures, training data, and fine-tuning strategies that explicitly target energy efficiency, potentially leveraging domain-specific models or retrieval-augmented generation.

For Researchers

  • Hardware Context is Crucial: Energy efficiency results are not generalizable across hardware; future studies must include diverse platforms and report hardware-specific findings.
  • Green Code Benchmarks Needed: The community should develop and maintain benchmarks of energy-efficient code to facilitate reproducible research and model evaluation.
  • Expertise Remains Valuable: Human expertise in green software engineering is still necessary; LLMs have not yet matched expert-level energy optimization.

Limitations and Threats to Validity

  • Language Scope: The paper focuses on Python; results may not generalize to other languages.
  • Hardware Coverage: Only three platforms are tested; broader hardware diversity may yield different results.
  • Expert Baseline: The Green expert's solutions represent a practical lower bound, not an absolute optimum.
  • LLM Evolution: Rapid advances in LLMs may affect the long-term relevance of these findings.

Conclusion

The empirical evidence demonstrates that, as of now, LLMs do not consistently generate energy-efficient Python code compared to human experts. Prompt engineering offers limited and unpredictable improvements. Hardware context is a major determinant of energy efficiency, and expert knowledge remains indispensable for green software development. The paper highlights the need for further research into LLM fine-tuning for non-functional requirements, the development of green code benchmarks, and the integration of energy efficiency as a first-class metric in both academic and industrial LLM evaluation.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Youtube Logo Streamline Icon: https://streamlinehq.com