Concentric Incremental Rating Circles (CIRC)
- Concentric Incremental Rating Circles (CIRC) is a deterministic, Euclidean-based method that maps energy efficiency and accuracy onto a normalized 2D plane.
- It assigns 1–5 ratings by computing the Euclidean distance from an ideal point using fixed concentric circles, ensuring transparent and reproducible evaluations.
- CIRC enables practical trade-offs between energy sustainability and functional performance with applications in code generation and summarization.
Concentric Incremental Rating Circles (CIRC) is a deterministic, Euclidean-based scoring method for benchmarking code LLMs (CLMs) on the joint criteria of energy efficiency and functional accuracy. Designed to deliver robust, interpretable 1–5 ratings, CIRC situates models in a normalized two-dimensional plane and uses concentric circles centered at the ideal point—representing maximal efficiency and accuracy—to assign categorical scores that reward models closest to the optimum. Its key objective is to balance energy sustainability with functional correctness in a transparent, reproducible manner, providing an outlier-immune and computationally trivial alternative to adaptive rating frameworks (Mehditabar et al., 10 Nov 2025).
1. Foundational Principles and Conceptual Framework
CIRC addresses the challenge of simultaneously evaluating CLMs for both energy efficiency and accuracy. Each model is represented as a point on a 2D plane: the -axis corresponds to normalized energy efficiency, while the -axis represents normalized accuracy. The normalization scales both metrics to across all evaluated models, with denoting the ideal performer (highest efficiency, highest accuracy) and the worst. The CIRC scheme overlays five concentric rings, spaced evenly in Euclidean distance, centered at this optimum.
This geometric construction interpolates between the extremes of arithmetic-mean (overly permissive) and harmonic-mean (overly unforgiving) objective fusion. By using the Euclidean norm, CIRC enables meaningful trade-offs between accuracy and efficiency, so models that slightly sacrifice one metric to significantly improve the other are not heavily penalized.
2. Mathematical Formulation
CIRC’s formal operation comprises normalization, distance computation, and discrete score assignment as follows:
Normalization:
For model with raw energy and raw accuracy , across all models,
Each model thus maps to .
Distance Calculation:
The Euclidean distance from the ideal point is: with for the extremal case .
Rating Assignment:
Divide into five equal intervals: A model is assigned a score based on which ring it occupies: or equivalently, with the ring index :
3. Algorithmic Workflow and Computational Characteristics
The CIRC algorithm applies the above formulations in linear time relative to the number of models :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
Input: Models M = {m₁,…,m_k} each with raw (E_m, A_m).
1. Compute global minima/maxima: E_min, E_max across all m A_min, A_max across all m
2. For each model m:
// 2.1 normalize
energy_norm = (E_m - E_min)/(E_max - E_min)
eff = 1 - energy_norm
acc_norm = (A_m - A_min)/(A_max - A_min)
// 2.2 map to 2D and compute distance
x = eff; y = acc_norm
d = sqrt((1 - x)^2 + (1 - y)^2)
// 2.3 assign rating
Δr = sqrt(2) / 5
k = ceil(d / Δr) // k in {1…5}
rating_m = max(1, 6 - k) // map k=1→5, k=5→1
Output: { rating_m | m ∈ M } |
This pipeline is computationally trivial, requiring only min–max normalization and basic arithmetic, with no data-driven hyperparameters or validation.
4. Robustness to Outliers and Invariance Properties
CIRC is robust to outliers due to its geometric and parameter-independent definitions. Both the ideal origin and the outermost circle boundary are fixed for all sets of candidates. The boundaries between score classes—the concentric rings—are also fixed increments in Euclidean space, uninfluenced by the empirical distribution or the presence of extreme performers. As a result, adding a model with exceptional or poor metrics does not impact the relative standing of other models or cause boundary stretching. The absence of fitted parameters ensures invariance to small subpopulations of outlying models.
5. Illustrative Case Studies
The application of CIRC is exemplified in two code-related tasks:
a) Code Generation (LiveCodeBench):
A model with in normalized efficiency and accuracy yields . With , this falls in the third concentric ring, so the score is 3.
b) Code Summarization (CodeXGLUE):
Models frequently map to high normalized accuracy and moderate efficiency , placing them in the second ring. These receive a score of 4. Only models achieving both metrics above 0.9 fall within the innermost ring, meriting the top score of 5.
6. Methodological Assumptions and Hyperparameterization
Core assumptions include:
- All normalization is performed using min–max scaling on the active model set. Adding out-of-range models requires re-normalization and re-scoring.
- The reference center is fixed as the “best case” point.
- Five rings correspond to a 1–5 rating, but rings and scores in are straightforward generalizations: .
- All increments are uniform in Euclidean space, embodying symmetric trade-off considerations.
7. Comparative Analysis: CIRC Versus Trend-Based Evaluation
CIRC and Observation to Expectation Rating (OTER) are proposed as complementary methods, each with distinct trade-offs:
| Feature | CIRC | OTER |
|---|---|---|
| Statistical Adaptivity | Static, geometric, data-independent | Trend-aware, curve-fitted |
| Outlier Sensitivity | Absolute, outlier-immune | Requires robust outlier removal |
| Rating Boundaries | Fixed concentric rings | Bins based on fitted expectation curve |
| Interpretability | Plug-and-play, transparent | Hyperparameter-dependent |
| Trade-off Handling | Symmetric, no frontier exploitation | Can reward frontier-breaking models |
CIRC’s core advantages are deterministic operation, transparency, and absolute reproducibility. Its limitations include insensitivity to the empirical energy–accuracy frontier and a lack of increased discrimination when models are highly clustered. OTER offers trend-sensitive rankings but requires additional machinery (robust curve-fitting, hyperparameter selection, and constrained optimization). Each method addresses combined energy–accuracy benchmarking from a distinct perspective, with CIRC prioritized when simplicity and reproducibility are paramount (Mehditabar et al., 10 Nov 2025).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free