ELANA Profiling Tool for LLM Deployment

Updated 14 December 2025

ELANA Profiling Tool is an open-source utility that measures LLM deployment costs including latency, energy usage, and memory footprint.
It employs precise metrics such as TTFT, TPOT, and TTLT to quantify inference latency and integrates with Hugging Face for standard benchmarking.
ELANA supports hardware–model co-design by providing actionable insights on GPU power consumption and cache usage across diverse platforms.

ELANA is an open-source profiling tool specifically designed to analyze the deployment costs—energy consumption and latency—of LLMs on a diverse range of hardware platforms, encompassing both multi-GPU cloud infrastructure and edge GPUs such as NVIDIA Jetson series. ELANA provides precise measurement of model size, key-value (KV) cache size, prefill and generation token latencies, end-to-end inference latency, and energy utilization. Direct integration with Hugging Face APIs, coupled with extensible architecture for custom or compressed models, positions ELANA as a foundational utility for empirical evaluation, optimization, and hardware–model co-design in research contexts focused on efficient large-scale neural language processing (Chiang et al., 7 Dec 2025).

1. High-Level Purpose and Principal Features

ELANA is engineered as a lightweight command-line profiling utility to enable reproducible, unified benchmarking of LLMs across heterogeneous computational substrates. Its feature set includes:

Model Size Analysis: Quantifies total parameter count (aggregating trainable, frozen, buffer parameters, including positional embeddings and quantized weights).
Inference-Time State Footprint Estimation: Computes the memory demand for KV caches in Transformer-decoder architectures and analogous "state caches" in state-space models (SSMs), parameterized by prompt length ( $L$ ) and batch size ( $B$ ).
Latency Measurement: Dissects inference into prefilling latency (TTFT: Time-To-First-Token), generation latency (TPOT: Time-Per-Output-Token), and overall request duration (TTLT: Time-To-Last-Token), adhering to precise wall-clock semantics.
Energy Consumption Profiling: Captures instantaneous GPU power via NVML (cloud) or jtop (Jetson), converting time metrics into Joules per prompt, token, and request.
Hugging Face Compatibility: Immediate support for Hugging Face model zoo; minimal adaptation required for custom/compressed models via subclassing.
Kernel-Level Profiling (Optional): Hooks into PyTorch Profiler and Holistic Trace Analysis to generate Perfetto timeline outputs, enabling in-depth GPU bottleneck analysis.
Platform Generality: Automatic GPU enumeration; concurrent profiling across devices; energy aggregation for multi-GPU setups.

These functionalities allow users to evaluate LLM deployment cost trade-offs under diverse operational regimes.

2. Metric Definitions and Analytical Formulas

ELANA formalizes each profiling metric with explicit definitions and computational formulas:

Model Size ( $S_{\text{model}}$ ):

$S_{\text{model}} = P \times w$

Where $P$ is the total parameter count and $w$ is byte width (fp32: 4 B; fp16: 2 B; 8-bit quantized: 1 B). Output in SI units.

KV Cache and State Cache Size ( $S_{\text{cache}}(L)$ ):

$S_{\text{cache}}(L) = 2 N d L w$

For a Transformer decoder with $N$ layers, hidden dimension $d$ , sequence length $L$ , and byte width $w$ .

Time-to-First-Token (TTFT):

$\text{TTFT} = t_{\text{first token}} - t_{\text{start}}$

Captures prefill-phase duration; unit: ms.

Time-per-Output-Token (TPOT):

$\text{TPOT} = \frac{1}{T_g} \sum_{i=1}^{T_g} [t_i - t_{i-1}]$

Where $T_g$ is output token count; unit: ms/token.

Time-to-Last-Token (TTLT):

$\text{TTLT} = \text{TTFT} + T_g \times \text{TPOT}$

Reports end-to-end latency; unit: ms.

Energy Logging ( $E$ ):
- Instantaneous power $p(t)$ sampled at $\Delta$ (default: 0.1 s).
- Average power $\overline{P}$ : $\overline{P} = (1/(t_1-t_0)) \int_{t_0}^{t_1} p(t)dt \approx (1/K) \sum_k p(t_k)$ .
- Total energy $E = \overline{P} \times (t_1 - t_0)$ .
- Results as:
- $J/\text{Prompt} = E_{\text{prefill}}$
- $J/\text{Token} = E_{\text{generation}}/T_g$
- $J/\text{Request} = E_{\text{total}}$

The adopted metric definitions are tightly aligned with operational bottlenecks in real-world LLM serving and facilitate comparative hardware–model studies.

3. Architecture, Implementation, and Profiling Workflow

ELANA's implementation is predicated on modular extensibility for rapid research iteration:

Hardware Support: Employs automatic GPU detection via CUDA_VISIBLE_DEVICES; launches per-device profiling processes. Multi-GPU mode aggregates energy metrics. Jetson edge devices utilize jtop for on-board SoC power sampling.
Hugging Face Integration: Models are loaded using standard Hugging Face AutoTokenizer and AutoModelForCausalLM; supports arbitrary repository checkpoints. For non-standard architectures or quantized weights, a _build_model_and_tokenizer method in Profiler can be overridden.
Profiling Flow:

CLI argument parsing (model ID, batch size, prompt/gen lengths, device selection, energy logging, profiling granularity).
Model/tokenizer instantiation; device migration.
Warm-up, if specified.
Prefilling latency: multiple prompts run, TTFT measured.
Generation latency: CUDA graph caching as available, TPOT profiling.
End-to-end latency: TTLT calculation.
Power sampling and energy computation in parallel.
Optional kernel-level tracing via PyTorch Profiler/Holistic Trace Analysis, exporting Perfetto-compatible timelines.

A plausible implication is that ELANA’s separation of latency components and concurrent energy measurement facilitates rigorous analysis of model-hardware interface effects.

4. Installation, Usage, and Extensibility

Installation prerequisites comprise Linux (Ubuntu 22.04 confirmed), Python ≥3.8, and relevant NVIDIA toolchains. Pip installable packages include torch (≥2.0), transformers, pynvml, jtop, psutil, and click.

1	pip install git+https://github.com/enyac-group/Elana.git

Example invocations:

Single-GPU latency profiling:

1 2	elana --model meta/llama-3.1-8b --device cuda:0 --batch-size 1 \ --prompt-length 512 --gen-length 512

Multi-GPU, energy logging:

1 2	elana --model qwen/qwen-2.5-7b --gpus 0,1,2,3 --batch-size 64 \ --prompt-length 512 --gen-length 512 --energy

Edge GPU (Jetson):

1 2	elana --model local/llama-3.2-1b --device cuda:0 \ --prompt-length 256 --gen-length 256 --energy

Custom bit-width/compressed model:
- Implement a Profiler subclass that overrides _build_model_and_tokenizer.
- Launch with elana --override override.MyProfiler --energy.

Extensibility: New architectures can be supported by subclassing Profiler (<400 lines) and overriding cache-size computations; additional hardware sensors integrated via new PowerSampler class; extra metrics and traces appended through CLI hooks.

5. Output Interpretation and Research Significance

ELANA generates structured tables for model size and cache usage, facilitating rapid hardware resource planning (e.g., verifying feasibility of specific batch and sequence lengths on available GPU RAM). For example:

Model	Params (GB)	KV Cache (b=1,L=1024) GB	KV Cache (b=128,L=1024) GB
Llama-3.1-8B	16.06	0.13	17.18

Latency and energy tables clarify operational costs at token-level granularity:

Model	TTFT (ms)	J/Prompt	TPOT (ms)	J/Token	TTLT (ms)	J/Request
Llama-3.1-8B	94.3	25.9	24.8	6.8	12859.8	3533.1

Perfetto timeline outputs enable bottom-up profiling, revealing kernel-level execution bottlenecks such as compute vs. memcpy overlaps and the relative resource intensities of model components (e.g., attention vs. MLP).

This suggests ELANA’s outputs are directly actionable in both operational provisioning and algorithm–system co-design. The tool thus provides a standardized analytic foundation for LLM efficiency studies and green AI initiatives.

6. Extending ELANA: Customization Pathways

ELANA’s architecture is purposefully minimalist to enable rapid adaptation:

New Model Integration: Subclass Profiler and redefine _build_model_and_tokenizer for alternate architectures or compression schemes.
Cache Calculations: For models diverging from Transformer assumptions (e.g., SSM), override cache-size estimator logic.
Hardware Adaptation: Integrate custom power sensors by implementing a PowerSampler instance and registering via CLI.
Additional Metrics: Employ --profile-kernels to trigger further PyTorch Profiler traces, subsequently analyzed through generated Perfetto JSON.

The built-in extensibility suggests ELANA is suitable for both standardized LLM benchmarking and bespoke low-level performance investigations, supporting a wide spectrum of efficiency-driven research.

7. Contextual Positioning and Research Impact

Released by the EnyaC research group, ELANA’s design directly addresses empirical bottlenecks in serving large-scale LLMs, particularly the interdependencies between latency, memory consumption, and energy usage on modern GPU platforms (Chiang et al., 7 Dec 2025). Its academic-friendly interface and open-source availability foster reproducible efficiency benchmarking, informing model compression, quantization, and hardware–software co-design efforts.

A plausible implication is that widespread adoption of ELANA in the research community can accelerate progress in energy-efficient model deployment and inform upstream decisions in algorithm, architecture, and hardware development. Its metric definitions and profiling protocols also provide a reference for standardized model evaluation and reporting in future publications.

PDF Markdown Chat (Pro)

References (1)

ELANA: A Simple Energy and Latency Analyzer for LLMs (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to ELANA Profiling Tool.