Preferential Amortized Black-Box Optimization

Updated 25 December 2025

PABBO is a framework that uses meta-learning and transformer-based neural processes to perform efficient black-box optimization via ordinal feedback.
The approach features a meta-learned policy for pairwise preference selection, yielding significant speedups over traditional GP-based methods.
PABBO is versatile for univariate, multivariate, and multi-objective tasks, with applications in simulation design, hyperparameter tuning, and human-in-the-loop optimization.

Preferential Amortized Black-Box Optimization (PABBO) denotes a class of methods for sample-efficient black-box optimization when evaluations are noisy, expensive, or only available through ordinal or preference feedback—rather than direct access to the underlying objective. PABBO frameworks amortize both surrogate modeling and acquisition function learning via meta-learning across a distribution of related tasks. The primary innovations include the meta-learned policy for pairwise preference (dueling) selection, transformer-based neural process architectures for high-throughput surrogate inference, and tailored auxiliary losses for stable and efficient reinforcement learning. PABBO has been instantiated for univariate, multivariate, and multi-objective optimization, with applications ranging from simulation-based design to hyperparameter tuning under strict budget constraints and scenarios requiring real-time human-in-the-loop optimization.

1. Formulation and Motivation

PABBO addresses optimization tasks where the black-box function $f$ is only accessible via ordinal (preference) comparisons over candidate pairs, not through direct scalar outputs. The goal is to identify $x^* = \arg\max_{x \in \mathcal{X}} f(x)$ , where for $x_1, x_2 \in \mathcal{X}$ , feedback is limited to whether $x_1$ is preferred to $x_2$ . This preference model reflects Thurstonian (probit) or logistic noise: $x_1 \succ x_2$ if $f(x_1) + \varepsilon_1 > f(x_2) + \varepsilon_2$ , $\varepsilon_i \sim \mathcal{N}(0, \sigma^2)$ , or equivalently via $P(y=1|f(x_1),f(x_2)) = \sigma(f(x_1) - f(x_2))$ , with $\sigma$ the sigmoid function. Classical preferential Bayesian optimization (PBO) typically employs a GP-based surrogate and acquisition optimization, but these entail high computational cost and non-conjugate likelihoods, prohibitive for interactive or real-time settings. PABBO methods amortize both surrogate inference and acquisition learning via meta-learning, enabling orders-of-magnitude faster inference and supporting strictly limited evaluation budgets (Zhang et al., 2 Mar 2025).

2. Model Architecture and Meta-Learning

The canonical PABBO architecture consists of a Transformer-Neural-Process model comprising the following:

Point Embedding: Each design $x$ and associated label $l$ are embedded using a shared MLP, producing feature vectors for both context (historical duels) and query-set candidates.
Transformer Encoder: Context and query embeddings are concatenated and processed through stacked multi-head masked self-attention layers, enabling permutation-invariant aggregation of pairwise preference data and context adaptation.
Acquisition Head: Takes pairs of transformed embeddings (optionally with normalized time) and outputs a scalar “pairwise acquisition” score, which is transformed into a categorical policy $\pi_\theta$ via softmax over all query pairs.
Prediction Head: Predicts Gaussian parameters $(\mu, \sigma^2)$ for surrogate modeling of latent utility for each candidate; pairwise preference probability is estimated by $\sigma(y_i - y_j)$ , with $y_i \sim \mathcal{N}(\mu_i, \sigma^2_i)$ .

PABBO learns these components end-to-end by minimizing a composite loss comprising a REINFORCE-based policy gradient (for acquisition) and auxiliary binary cross-entropy loss (for predictive accuracy). Meta-training is performed on a large synthetic or logged dataset of preference duels sampled from a prior task distribution (Zhang et al., 2 Mar 2025).

3. Algorithmic Workflow and Pseudocode

At test time, the procedure operates as follows:

Candidate set of size $S$ is sampled (e.g., via Sobol sequence); all or a subset of $O(S^2)$ pairs are formed.
At each optimization step $t$ $t$ :
- Embed historical context and candidate queries.
- Compute acquisition scores via the acquisition head and generate a policy $\pi_\theta$ over pairs.
- Sample a pair $(x_{i*}, x_{j*})$ for preference feedback, update the context with the observed label.
Repeat for the allotted budget $T$ duels, and output the design with highest observed (inferred) utility.

Empirical runtime is $O(S^2)$ per step, dominated by pairwise acquisition evaluation—a negligible cost relative to classic GP-based PBO, which scales as $O(n^3)$ in the number of observations (Zhang et al., 2 Mar 2025). Batch variants (e.g., “Batch-PABBO”) can further amortize candidate selection in parallel environments.

4. Application Domains and Benchmarking

PABBO has been applied in diverse settings:

Few-Shot Black-Box Optimization: “HPFSO” applies an offline-configured preferential down-selection policy to portfolios of optimizers under extreme budget constraints (100 evaluations, batches of 8), achieving mean normalized cost $c=0.067$ (50% reduction over HEBO/CMA-ES) on BBOB/COCO benchmarks (Ansotegui et al., 2021).
Discrete Hyperparameter Tuning: For topic number selection in LDA, PABBO, trained with reinforcement learning and auxiliary losses, reaches near-optimal held-out perplexity after $3$–$5$ queries—far outperforming genetic algorithms and evolution strategies in sample- and wall-clock efficiency (Akramov et al., 18 Dec 2025).
Human-in-the-Loop and Out-of-Distribution Tasks: On preference-driven human datasets (e.g., Sushi/Candy) and high-dimensional HPO, PABBO matches or exceeds GP-based PBO at a fraction of the computational cost (Zhang et al., 2 Mar 2025).
Multi-Objective Optimization: The A-GPS framework generalizes PABBO to the generation of Pareto sets, using preference-aligned conditional generative models and class probability estimators for both non-dominance and subjective alignment, achieving state-of-the-art performance in many-objective benchmarks (Steinberg et al., 23 Oct 2025).

Key Quantitative Outcomes

Benchmark	PABBO Regret (Simple)	Comparison	Speedup
HPO-B 6–16D	Best or 2nd	Beats GP-based PBO	$10 \times$ or more
LDA topic tuning	3–5 evals to plateau	GA/ES require 20	GA slowest, PABBO fastest
BBOB/COCO	$c=0.067$	HEBO/CMA: $0.135/0.142$	$50\%$ cost reduction

PABBO architectures also demonstrate robust out-of-distribution generalization with only mild performance drop under pure ranking-based reward pretraining.

5. Instantiations: Discrete, Batch, and Multi-Objective PABBO

Discrete Domains: PABBO’s architecture accommodates categorical or integer spaces by embedding discrete candidate pairs and learning preference policies without numeric access to $f(x)$ , only pairwise duels (Akramov et al., 18 Dec 2025).
Batch/Parallel PABBO: For scenarios with multiple parallel evaluations per round, batch versions of PABBO select candidate sets via Monte Carlo policies and offline-tuned feature scoring, combining diverse single-objective optimizers (CMA, DE, GBM-LCB, GGA++, TuRBO) for high sample efficiency (Ansotegui et al., 2021).
Multi-Objective PABBO (A-GPS): Learns a conditional generator $q_\theta(x|u)$ over solutions, where $u$ is a user-specified or learned preference direction in the objective space. Class-probability estimators for non-dominance and direction alignment guide the generator toward any region of interest on the Pareto front without retraining (Steinberg et al., 23 Oct 2025).

6. Theoretical Guarantees and Limitations

PABBO policies optimized by REINFORCE inherit asymptotic convergence guarantees under standard regularity conditions (sufficient exploration rate $\rho>0$ , bounded gradients). However, no explicit finite-sample or Bayesian regret bounds are available, and performance depends critically on the representativeness of meta-training data (Zhang et al., 2 Mar 2025, Akramov et al., 18 Dec 2025). The auxiliary surrogate head stabilizes optimization, analogous to “behavior cloning” regularization.

Identified limitations include:

Quadratic runtime in candidate-set size, which is mitigated via moderate $S$ or batch strategies.
Fixed input dimensionality per meta-trained model; separate models are needed for different $d$ .
Heavy reliance on the quality and breadth of meta-data for generalization.

7. Practical Guidance and Configurations

Typical configuration guidelines are as follows:

Query-set size $S \approx \min(300, 100 \cdot d)$ .
Transformer depth 6, hidden size 64, feed-forward width 128, 4 heads (or architecture-specific variants).
REINFORCE discount $\gamma=0.98$ , auxiliary loss weight $\lambda=1$ .
Warm-up phase with auxiliary loss only, $\sim$ 3,000 iterations.
Meta-training on synthetic (e.g., GP draw), prior, or logged data sets.

In online deployment, real-time PABBO inference supports interactive, preference-based search or batched evaluations with latencies $<$ 1s per duel, enabling applications in design personalization, hyperparameter fitting, and many-objective sequence design (Zhang et al., 2 Mar 2025, Akramov et al., 18 Dec 2025, Steinberg et al., 23 Oct 2025, Ansotegui et al., 2021).