COCO Platform for Bi-objective Optimization

Updated 20 February 2026

The paper introduces a rigorous benchmarking framework that pairs well-understood bbob functions to evaluate bi-objective black-box optimization performance.
COCO leverages hypervolume-based metrics and empirical running time assessments to ensure reproducible, instance-averaged results.
Its test suites, bbob-biobj and bbob-biobj-ext, provide scalable, diverse problem instances critical for comparing both deterministic and stochastic optimizers.

COCO (COmparing Continuous Optimizers) provides a rigorous, reproducible benchmarking framework for black-box optimization, supporting both single- and multi-objective problems. Within COCO, the bbob-biobj and bbob-biobj-ext suites serve as standardized testbeds for bi-objective black-box optimization, leveraging combinations of well-understood single-objective functions to produce challenging, scalable, and diverse evaluation scenarios. The platform prescribes protocols for function instantiation, performance assessment (notably through hypervolume-based indicators and empirical running time), and data-driven, instance-averaged analysis. The overall aim is to enable fair and meaningful comparison of both deterministic and stochastic optimizers across a broad range of continuous bi-objective optimization problems (Brockhoff et al., 2016, Brockhoff et al., 2016, Hansen et al., 2016, Loshchilov et al., 2016).

1. Motivation and Design Principles

COCO’s bi-objective benchmarking methodology addresses critical weaknesses of traditional multi-objective test suites by constructing each bi-objective problem as a pair of single-objective bbob functions, inheriting their calibrated and well-studied properties. This construction avoids non-representative features such as excessively separable or boundary-aligned Pareto fronts, and artificially structured decision variables often found in classical Pareto-optimal test problems.

Rather than focusing on artificial Pareto front shapes, the bbob-biobj design is rooted in the observation that most real-world MOO problems combine scalar objectives arising from distinct sources or modeling phenomena. By leveraging 24 archetypal bbob functions organized into five difficulty groups (separable, moderate conditioning, ill-conditioned, multi-modal with global structure, and weakly structured multi-modal), and pairing them systematically, the test suites expose algorithmic strengths and weaknesses across relevant landscape features such as multimodality, ill-conditioning, non-separability, and smoothness (Brockhoff et al., 2016).

2. Test Suite Construction and Function Definition

The core test suite, bbob-biobj, comprises 55 unique bi-objective minimization problems, each defined as

$F(x) = (f_\alpha(x), f_\beta(x)), \quad x \in \mathbb{R}^n$

where each $f_\alpha$ , $f_\beta$ is a bbob single-objective function subject to randomly generated instance-specific transformations:

$f^\theta(x) = H^\theta\big(f_\mathrm{raw}(T^\theta(x))\big)$

Transformations $T^\theta$ (search-space shifts, rotations, coordinate perturbations) and $H^\theta$ (objective shifts, monotone distortions) ensure that each problem realization is statistically independent but structurally similar.

The 55 bbob-biobj problems enumerate all unordered pairs from 10 representative bbob functions (two per group where feasible), while the extended bbob-biobj-ext suite comprises 92 variants, adding within-group combinations to balance and diversify within-group challenge without redundancy. Each bi-objective test problem is, by design, scalable in dimension $n$ and is instantiated with 15 pseudo-random instances (distinct transformation parameters), permitting statistically robust benchmarking (Brockhoff et al., 2016).

3. Objective-Space Normalization and Performance Metrics

Objective normalization is crucial for cross-function and cross-algorithm comparison. Raw outputs are mapped to $[0,1]^2$ via

$\tilde{F}_i(x) = \frac{f_i(x) - z_i^*}{z_i^\mathrm{nadir} - z_i^*}, \quad i \in \{\alpha, \beta\}$

where $z_i^*$ is the ideal value (global minimum of $f_i$ ) and $z_i^\mathrm{nadir}$ is the nadir value (largest $f_i$ among Pareto-optimal points under $f_j$ ). For most functions, these can be efficiently computed from known minima (Brockhoff et al., 2016, Brockhoff et al., 2016).

The principal quality indicator is the dominated hypervolume (HV) of the non-dominated archive $A_t$ w.r.t. a reference point ( $r=(1,1)$ in normalized space):

$\mathrm{HV}(A_t) = \lambda\left(\bigcup_{a \in A_t} [f'_\alpha(a),1] \times [f'_\beta(a),1]\right)$

Runtimes are measured as the minimal number of function evaluations required to reach prescribed target hypervolume precisions $\Delta I$ (uniformly spaced log-scale steps within $[0,1]^2$ ), aggregated over instances and problems (Brockhoff et al., 2016, Hansen et al., 2016).

4. Benchmarking Protocol and COCO Workflow

COCO’s benchmarking is configured around three architectural layers: (1) the suite generator (e.g., bbob-biobj, bbob-biobj-ext) that handles problem instantiation, (2) the experiment observer which logs optimizer progress, archive states, and triggers event recording on reaching new targets, and (3) the post-processing toolchain producing reproducible tables and ECDF plots.

Each experiment run follows this protocol:

For each problem instance (specified by problem ID, dimension, and instance number), initialize optimizer and attach observer.
Iterate: propose candidate $x$ , evaluate $F(x)$ (single function call returns $\mathbb{R}^2$ vector), update non-dominated archive. The observer incrementally computes HV and checks for achievement of each target indicator $I_{i,t}$ .
On target attainment, the evaluation count is logged as the runtime for that target.
Proceed to the next problem. When complete, summarized logs (evaluations per target, HV traces) are post-processed into ECDFs and ERT tables (Brockhoff et al., 2016, Hansen et al., 2016, Loshchilov et al., 2016).

COCO provides language-specific wrappers (notably Python/cocoex), a standardized observer interface, and command-line tools for automated analysis and visualization.

5. Problem Instances, Randomization, and Statistical Rigor

Each bi-objective problem instance is determined by a deterministic mapping from a single integer $K_\mathrm{ID}^F$ to two single-objective instance IDs:

$K_\mathrm{ID}^{f_\alpha} = 2 K_\mathrm{ID}^F + 1, \quad K_\mathrm{ID}^{f_\beta} = K_\mathrm{ID}^{f_\alpha} + 1$

To avoid spurious similarity, instance re-generation skips cases where optima are within $10^{-4}$ in $l_2$ or where the objective-range difference is below $10^{-1}$ . Each instance is thus reliably independent, justifying performance aggregation over the set of 15 instances per problem (Brockhoff et al., 2016).

This instance-driven paradigm enables principled comparisons between deterministic and stochastic algorithms: each instance is an independent trial. Thus, runtime statistics—ERT (expected running time), ECDFs—are validly averaged across runs and algorithms, mitigating risks of overfitting or bias to a particular function shape or orientation (Hansen et al., 2016).

6. Aggregation, Visualization, and Analysis

Post-processing aggregates runtime data into empirical cumulative distribution functions (ECDF), which plot the fraction of (problem, instance, target) pairs solved within a given evaluation budget. These “data profiles” synthesize both anytime behavior and overall algorithmic robustness. Other aggregated views include ERT versus dimension, per-problem performance breakdowns, attainment surfaces, and indicator-specific histograms.

COCO’s post-processing utilities (e.g., cocopp) offer standardized output, including hypervolume convergence plots, statistical tables, and LaTeX-ready figures, facilitating reproducible, transparent reporting suited for academic publications (Brockhoff et al., 2016, Hansen et al., 2016). Negative target precisions (allowing solutions to outperform the reference front) and tables of ERT ratio provide diagnostic detail.

7. Extensibility and Recommendations

The construction methodology of bbob-biobj generalizes to $m$ -objective suites: all $m$ -wise multicombinations of the five function groups determine problem classes, with an instance sampling protocol ensuring balanced within- and across-group coverage. For each $m$ -tuple of groups, the suite samples a predefined number of problems, yielding a scalable, extensible testbed for arbitrary multi-objective black-box optimization benchmarking (Brockhoff et al., 2016).

Best practices include: treating problem instances as independent trials (critical for assessing stochastic algorithms), reporting ERT and HV-based curves, and isolating dimension-specific conclusions rather than cross-dimensional aggregates. Summary recommendations also emphasize maintaining a mix of problem classes (for broad generalization claims) and exploiting COCO’s automated visualization and reporting for statistical transparency (Brockhoff et al., 2016, Hansen et al., 2016, Loshchilov et al., 2016).