TokenPowerBench: LLM Energy Benchmark
- TokenPowerBench is a lightweight, open-source framework that systematically quantifies LLM inference energy at GPU, node, and system levels.
- It uses a declarative configuration interface with YAML/JSON to enable reproducible experiments across varying models, hardware, and inference engines.
- Empirical analyses reveal that factors like model scaling, batch size, and quantization dramatically influence energy efficiency, guiding sustainable LLM operations.
TokenPowerBench is a lightweight, extensible open-source benchmarking framework specifically designed to measure and analyze the power consumption of LLM inference. Unlike prior benchmarks that focus on training or throughput, TokenPowerBench systematically quantifies energy use at the level of GPU, node, and system, decomposing consumption across the inference prefill and decode phases. The platform supports fine-grained experimental control by enabling users to specify model, workload, hardware, and inference engine parameters through a declarative interface. It facilitates energy-efficiency analysis, infrastructure planning, and sustainability tracking for large-scale LLM deployments (Niu et al., 2 Dec 2025).
1. Declarative Configuration Interface
TokenPowerBench employs human-readable YAML or JSON configuration files for complete experiment specification. Key configuration fields include model selection (e.g., "llama3-405B", "falcon-180B", "qwen-32B", "mistral-8×7B"), prompt set (predefined such as "alpaca", "longbench", or custom CSV/JSON), inference engine ("transformers", "deepspeed", "tensorrt_llm", "vllm"), hardware topology (e.g., "1×H100", "4×H100", "8×H100-8node"), batch size, context length (maximum prompt length), parallelism (tensor-parallel and pipeline-parallel settings), and quantization type ("fp16", "fp8").
Example YAML configuration:
1 2 3 4 5 6 7 8 9 10 |
model: llama3-405B prompt_set: alpaca engine: vllm hardware: 4xH100 batch_size: 256 context_length: 2048 parallelism: tp: 4 pp: 4 quantization: fp16 |
Command-line invocation supports both YAML and JSON interfaces. This explicit, versionable configuration structure enables reproducible, parameterized experiments and systematic exploration of factors affecting inference energy (Niu et al., 2 Dec 2025).
2. Measurement Layer: Sampling and Integration
TokenPowerBench eschews specialized external power meters. Instead, it samples:
- GPU power via NVIDIA Management Library (NVML/DCGM) or nvidia-smi,
- CPU and DRAM power using Intel RAPL model-specific registers (MSRs),
- Whole-node (wall-plug) power via IPMI or rack-level power distribution unit (PDU) APIs such as Redfish/IPMI.
Sampling occurs at a default rate of 1 Hz (user-configurable up to 10 Hz); each time-stamped record allows precise temporal alignment. Power samples in watts are numerically integrated to compute joules over the experiment:
This methodology ensures detailed attribution of energy use across hardware subsystems without additional instrumentation (Niu et al., 2 Dec 2025).
3. Phase-Aligned Metrics Pipeline
TokenPowerBench partitions inference into two logical, temporally annotated phases:
- Prefill—processing the full prompt (all input tokens).
- Decode—generating output tokens.
Each inference call is wrapped with "phase-start" and "phase-end" events, allowing all samples to be tagged by phase. Energy is then integrated by phase: This phase-level resolution enables quantification of how energy is spent on input versus output processing, supporting actionable bottleneck analysis and informing software/hardware co-design (Niu et al., 2 Dec 2025).
4. Energy-Efficiency Metrics and Reporting
TokenPowerBench computes a suite of normalized, phase-aware energy metrics:
- Total energy per request:
- Joules per decoded token:
- Joules per response: ,
- Instantaneous power: average and peak, .
Optionally, users may supply the electricity price $c_{\$}/\mathrm{kWh}\alpha_{\mathrm{kgCO2}/\mathrm{kWh}}\text{Cost}_\%%%%6%%%%},\quad \mathrm{CO}_2 = E_{\text{total}}(\mathrm{kWh}) \times \alpha$ Such systematic reporting facilitates inference operating expense forecasting and carbon accounting within LLM service operations (Niu et al., 2 Dec 2025).
5. Systematic Experimental Variables
TokenPowerBench has been empirically validated across a comprehensive matrix:
- Model series: LLaMA (1B–405B), Falcon (7B–180B), Qwen (8B–480B), Mistral (7B–8×22B Mixtral MoE).
- Inference engines: Transformers, DeepSpeed-Inference, TensorRT-LLM, vLLM.
- Hardware: up to 8-node clusters, each with 4×H100 GPUs.
- Batch size: Swept from 1 to 1024.
- Context length: 0–10,000 input tokens.
- Parallelism: e.g., TP4/PP4, TP8/PP2, TP16/PP1.
- Quantization: FP16 vs FP8.
This broad coverage enables characterization of both architectural and load-dependent variables affecting energy consumption in LLM inference (Niu et al., 2 Dec 2025).
6. Empirical Findings and Prescriptive Insights
TokenPowerBench experiments demonstrate:
- Super-linear scaling of energy per token: For LLaMA-3, J/token increases ~7.3× moving from 1B to 70B parameters, compared to 70× growth in parameters, reflecting memory-bandwidth and cache-traffic overheads.
- Sparsity advantage for MoE: Mixtral-8×7B MoE models consume only ~⅓ the J/token of a dense 8B model at similar quality, as only two experts are active per token.
- Engine-specific effects: TensorRT-LLM and vLLM draw ~3× higher power during prefill (prompt ingestion) than Transformers or DeepSpeed, but their optimized decoding reduces decode-phase energy by 25–40%.
- Batch-size effect: J/token drops by 25% as batch size increases from 32 to 256 (due to increased GPU utilization); above 256, gains plateau, but further reduction is observed up to 1024.
- Prompt context impact: Increasing prompt length from 2k to 10k tokens raises prefill energy linearly and inflates J/token by ~3×.
- Parallelism strategy: Pure tensor-parallelism (TP16/PP1) minimizes J/token, outperforming TP/PP mixes under all loads; mixed configurations incur pipeline bubbles and additional idle power, with >20 J/token disparity under high throughput.
- Quantization: FP8 reduces energy per token by ∼30% without significant accuracy loss, observed in LLaMA-3 405B under heavy batching, lowering total energy from ~45 kJ to ~32 kJ and raising throughput from 48 to 63 tokens/s.
Recommended best practices:
- Use the largest feasible batch size compatible with latency requirements (typically 256–512).
- Avoid excessive prompt length; only supply necessary context.
- Prefer pure tensor-parallel distribution for large models.
- Use low-precision quantization (FP8) when hardware permits.
- Consider vLLM or TensorRT-LLM for large-batch decode to offset higher prefill cost with lower decoding J/token.
Open-source availability, reproducible phase-aware metrics, and configuration sweeps collectively enable direct operational expense estimation and carbon footprint management for LLM inference at scale (Niu et al., 2 Dec 2025).
7. Extension Pathways Informed by Multi-modal Benchmarking
The 3MEthTaskforce framework (Li et al., 21 Jan 2025) provides a blueprint for further extensibility:
- Integrate multi-modal data beyond tokens (e.g., on-chain liquidity, sentiment, external global indices) to enable richer analytics.
- Extend benchmarking to multi-chain or cross-infrastructure deployments for interoperability studies.
- Incorporate high-frequency sampling for fine-grained or ultra-low latency settings.
- Enrich with anomaly detection, risk metrics (e.g., VaR, expected shortfall), regulatory analysis (whale detection, K-anonymity), and network-theoretic measures (Node/PageRank centrality).
- Support advanced evaluation protocols (e.g., rolling window forecast, behavior segmentation) for comprehensive performance analyses.
A plausible implication is that TokenPowerBench can underpin a broader class of multi-modal, energy-aware benchmarking tools in LLMs, extending beyond power accounting to holistic system optimization and forecasting (Li et al., 21 Jan 2025, Niu et al., 2 Dec 2025).