Fair comparison and principled assessment of efficiency–effectiveness trade-offs

Establish fair comparison and principled assessment frameworks for the efficiency–effectiveness trade-off in efficient reasoning methods for large language models, ensuring reliable evaluation across paradigms, backbone scales, and reasoning domains.

Background

The paper critiques existing metrics such as Accuracy per Computation Unit (ACU) and Accuracy–Efficiency Score (AES) for misrepresenting improvements and relying on heuristic thresholds, leading to instability and unfair comparisons.

Motivated by these limitations, the authors call out the lack of a principled, smooth, and stable evaluation framework for the efficiency–effectiveness balance as an open challenge, which they aim to address with the proposed E^3-Score.

References

As a result, both fair comparison and principled assessment of efficiencyâeffectiveness balance for efficient reasoning remain open challenges.

— EffiReason-Bench: A Unified Benchmark for Evaluating and Advancing Efficient Reasoning in Large Language Models (2511.10201 - Huang et al., 13 Nov 2025) in Related Work, Benchmarks for Reasoning Efficiency

Fair comparison and principled assessment of efficiency–effectiveness trade-offs

Background

References

Related Problems