carps: A Framework for Comparing N Hyperparameter Optimizers on M Benchmarks (2506.06143v1)

Published 6 Jun 2025 in cs.LG

Abstract: Hyperparameter Optimization (HPO) is crucial to develop well-performing machine learning models. In order to ease prototyping and benchmarking of HPO methods, we propose carps, a benchmark framework for Comprehensive Automated Research Performance Studies allowing to evaluate N optimizers on M benchmark tasks. In this first release of carps, we focus on the four most important types of HPO task types: blackbox, multi-fidelity, multi-objective and multi-fidelity-multi-objective. With 3 336 tasks from 5 community benchmark collections and 28 variants of 9 optimizer families, we offer the biggest go-to library to date to evaluate and compare HPO methods. The carps framework relies on a purpose-built, lightweight interface, gluing together optimizers and benchmark tasks. It also features an analysis pipeline, facilitating the evaluation of optimizers on benchmarks. However, navigating a huge number of tasks while developing and comparing methods can be computationally infeasible. To address this, we obtain a subset of representative tasks by minimizing the star discrepancy of the subset, in the space spanned by the full set. As a result, we propose an initial subset of 10 to 30 diverse tasks for each task type, and include functionality to re-compute subsets as more benchmarks become available, enabling efficient evaluations. We also establish a first set of baseline results on these tasks as a measure for future comparisons. With carps (https://www.github.com/automl/CARP-S), we make an important step in the standardization of HPO evaluation.

Authors (16)

Carolin Benjamins (12 papers)
Helena Graf (2 papers)
Sarah Segel (2 papers)
Difan Deng (10 papers)
Tim Ruhkopf (3 papers)
Leona Hennig (4 papers)
Soham Basu (7 papers)
Neeratyoy Mallik (12 papers)
Edward Bergman (8 papers)
Deyao Chen (2 papers)
François Clément (24 papers)
Matthias Feurer (19 papers)
Katharina Eggensperger (18 papers)
Frank Hutter (177 papers)
Carola Doerr (117 papers)
Marius Lindauer (71 papers)

Summary

The paper introduces CARPS, a standardized framework that simplifies the benchmarking of HPO methods to enhance usability and reproducibility.
It details methodological innovations such as subselection via star discrepancy and the integration of varied tasks including BB, MF, MO, and MOMF.
Empirical evaluations using tests like the Friedman and Nemenyi validate the framework’s effectiveness in reliably ranking optimizer performance.

Overview of the CARPS Framework for Hyperparameter Optimization Benchmarking

The paper "carps: A Framework for Comparing Hyperparameter Optimizers on Benchmarks" presents an advanced framework designed to facilitate the evaluation and benchmarking of hyperparameter optimization (HPO) methods. The authors introduce CARPS, which addresses several challenges faced in the process of optimizing hyperparameters of machine learning models across diverse benchmarking tasks. The framework simplifies the integration of new optimizers and benchmarks by using a standardized, lightweight interface, thus promoting the ease of use, extensibility, and scalability.

Framework Design and Functionality

CARPS is structured to provide a streamlined process for prototyping, developing, and benchmarking HPO methods. The interface between optimizers and tasks is deliberately kept lean, utilizing the established ConfigSpace library for hyperparameter configuration spaces. This is supported by two core structures, TrialInfo and TrialValue, which encapsulate information for trial execution and results, respectively.

The framework is equipped to handle four primary types of HPO tasks:

Blackbox (BB) Tasks: Classic optimization problems where only the inputs and outputs are accessible without intermediary insight into the process.
Multi-Fidelity (MF) Tasks: Optimization tasks that allow querying the objective function at varying levels of computational resources, allowing for more efficient approximate evaluations.
Multi-Objective (MO) Tasks: Tasks involving multiple objectives, adding complexity to the optimization process.
Multi-Fidelity-Multi-Objective (MOMF) Tasks: Combining elements of both multi-objective and multi-fidelity optimization.

CARPS integrates benchmarks from several prominent suites such as BBOB, HPOBench, YAHPO, MFPBench, and Pymoo-MO, forming a comprehensive library of tasks that accommodate various configurations in terms of dimensionalities and objectives.

Subselection Methodology for Representative Benchmarking

The authors address the computational infeasibility of evaluating optimizers across all tasks due to the sheer volume available. They propose and implement a subselection process for each task type based on minimizing the star discrepancy—a measure of uniform distribution within a point set. This facilitates the creation of two disjoint subsets for development and testing, ensuring efficient evaluation and unbiased reporting of optimizer performance.

Selection is informed by performance data from several optimizers, including RandomSearch, Bayesian Optimization, and CMA-ES, reflecting a practical approximation of HPO challenges. Through this method, representative subsets of tasks are determined, which cover the objective function space effectively.

Benchmarking Experimentation and Analysis

CARPS provides a detailed analysis pipeline, employing non-parametric methods to evaluate optimizer performance across task types. The framework conducts extensive empirical evaluations, leveraging statistical tests like the Friedman test and Nemenyi test to assess performance rankings and critical differences among optimizers.

The experiments highlight significant findings regarding optimizer efficacy on complex tasks, emphasizing consistency in performance ranking between development and test sets. Despite variations in optimizer strengths across tasks, CARPS offers a reliable methodology for identifying potential complementary optimizers, guiding users in selecting appropriate strategies for distinct HPO challenges.

Implications and Future Directions

CARPS represents a substantial advancement in standardizing HPO evaluation frameworks, reducing computational overhead and promoting reproducibility. The comprehensive integration of benchmarks and optimizers paves the way for more nuanced evaluations and developments in optimizer design.

Future adaptations of the framework could potentially extend to broader AutoML challenges, incorporating parallel execution models and constraint-based optimization, further enriching the ecosystem of HPO tools. Additionally, CARPS could serve as a foundational infrastructure for active benchmarking, dynamically selecting tasks to present a holistic view of optimizer capabilities.

By significantly lowering the entrance barrier to HPO benchmarking, CARPS is poised to enhance research rigor and accelerate advancements in optimizing machine learning processes.

PDF Markdown