Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 92 tok/s
Gemini 2.5 Pro 47 tok/s Pro
GPT-5 Medium 32 tok/s
GPT-5 High 36 tok/s Pro
GPT-4o 88 tok/s
GPT OSS 120B 471 tok/s Pro
Kimi K2 220 tok/s Pro
2000 character limit reached

Preference-based Pure Exploration (PrePEx)

Updated 26 August 2025
  • Preference-based Pure Exploration (PrePEx) is a framework that identifies Pareto optimal arms in multi-objective bandits using preference cones to induce partial orders.
  • It leverages sample complexity lower bounds and structural reductions with algorithms like PreTS and FraPPE to enhance exploration efficiency.
  • The approach applies to domains such as clinical trials, engineering design, and autonomous systems, with potential to learn preference cones from data.

Preference-based Pure Exploration (PrePEx) is a framework for sample-efficient identification of Pareto optimal arms or policies in stochastic environments where feedback is provided in the form of preferences, and rewards are vector-valued and partially ordered by a given preference cone. PrePEx generalizes scalar best-arm identification to the multi-objective, preference-driven regime, requiring new algorithmic, statistical, and computational tools to match lower complexity bounds under arbitrary preference structures. Recent work (notably (Shukla et al., 4 Dec 2024, Das et al., 22 Aug 2025)) has advanced the theory, algorithm design, and implementation of PrePEx, enabling scalable and rigorous Pareto set discovery in vector-valued bandits.

1. Problem Definition and Preference Modeling

The PrePEx framework considers a K-armed bandit where each arm kk yields an LL-dimensional reward vector MkRLM_k \in \mathbb{R}^L upon pull. The arms are compared using a convex, closed, pointed preference cone CRL\mathcal{C} \subseteq \mathbb{R}^L, which induces a partial ordering: vector uu is preferred to vv if uvCu - v \in \mathcal{C}. The goal is to identify, with confidence at least 1δ1-\delta, the set of Pareto optimal arms—that is, the subset that is undominated with respect to C\mathcal{C}.

Mathematically, the task is to solve:

$\max_{\pi \in \Delta_K} M \pi \text{ (Pareto sense under $\mathcal{C}$)},$

with ΔK\Delta_K the KK-simplex, and to output the empirical Pareto front

Π(M)={πΔK:π,  MπCMπ}.\Pi^*(M) = \{\pi \in \Delta_K: \nexists \pi',\; M\pi' \succ_{\mathcal{C}} M\pi \}.

The formulation generalizes scalar best-arm identification to settings with multiple objectives and nontrivial preference structures.

2. Sample Complexity Lower Bound and Cone Geometry

A core theoretical contribution of PrePEx is the information-theoretic lower bound on sample complexity for any (1δ)(1-\delta)-correct algorithm (Shukla et al., 4 Dec 2024, Das et al., 22 Aug 2025). The bound is:

(TM,C)1=supwΔKinfπΠ(M) πΠ(M)infM~AinfzCk=1KwkD(zMk,zM~k),(T_{M,\mathcal{C}})^{-1} = \sup_{w \in \Delta_K} \inf_{\substack{\pi \notin \Pi^*(M)\ \pi^* \in \Pi^*(M)}} \inf_{\widetilde{M} \in \mathcal{A}} \inf_{z \in \mathcal{C}} \sum_{k=1}^K w_k\, D(z^\top M_k,\, z^\top\widetilde{M}_k),

where ww is the allocation vector, D(,)D(\cdot,\cdot) a suitable divergence (e.g., KL), and A\mathcal{A} is a set of "confusing" alternative instances.

For Gaussian rewards, the expression specializes to a bilinear projection form:

(TM,CGaussian)1=infπΠminzC\{0}(zM(ππ))22Tr(Σ)ππ2.(T_{M,\mathcal{C}}^{\text{Gaussian}})^{-1} = \inf_{\pi \notin \Pi^*} \min_{z \in \mathcal{C}\backslash\{0\}} \frac{(z^\top M(\pi^*-\pi))^2}{2\, \mathrm{Tr}(\Sigma)\|\pi^* - \pi\|^2}.

The essential difficulty of PrePEx is controlled not only by the reward gaps but also by the geometry of C\mathcal{C} (e.g., width, orientation, facets), which determines which vector differences are hardest to distinguish.

3. Algorithmic Foundations: Track-and-Stop and Structural Reduction

Preference-based Track-and-Stop (PreTS) (Shukla et al., 4 Dec 2024) and its computationally efficient generalization FraPPE (Das et al., 22 Aug 2025) form the core algorithmic approaches.

Key methodological steps are:

  • Sequential Allocation: At each round, estimate means M^t\widehat{M}_t and compute a probability vector wtw_t by (approximately) solving the lower-bound optimization. Pull the arm most under-sampled relative to wtw_t.
  • Structural Properties: FraPPE exploits three properties:

    1. The Pareto set is spanned by pure policies tied to Pareto arms.
    2. For each Pareto optimal arm ii, the most informative alternatives are its "neighbors" on the Pareto front (restricted minimization).
    3. The confusing set of alternatives (Alt-set) partitions into a union of convex sets, eliminating expensive convex hulls.
  • Frank-Wolfe Optimization: FraPPE employs a projection-free Frank-Wolfe method for the outer maximization over ww, achieving low per-iteration cost—O(KL2)\mathcal{O}(K L^{2}) for K arms, L objectives.

  • Stopping Rule: Based on Chernoff bounds, the algorithm halts when the cumulative empirical information (projected divergences along cone directions) certifies the empirical Pareto set with high confidence.

This approach ensures that the allocation follows the minimax optimal allocation dictated by the information constraints of the cone-ordered, multi-objective bandit.

4. Mathematical Formulation and Optimality

FraPPE operationalizes the lower-bound computation via:

(TM,C)1=maxwΔKmin(i,j) neighborsminτΛˉij(M)minzCB(1)kwk[zMkzτk],(T_{M,\mathcal{C}})^{-1} = \max_{w \in \Delta_K} \min_{\text{(i,j) neighbors}} \min_{\tau \in \bar{\Lambda}_{ij}(M)} \min_{z \in \mathcal{C} \cap \mathcal{B}(1)} \sum_{k} w_k\, [z^\top M_k \cdot z^\top\tau_k],

where Λˉij(M)\bar{\Lambda}_{ij}(M) is the restricted alternative set associated with Pareto neighbor pairs.

Stopping occurs if, for every candidate and neighbor, the empirical cumulative information is above a threshold:

miniΠt,jneighborinfτΛˉij(M^t)minzCB(1)kNk,t[zM^k,tzτk]c(t,δ)\min_{i \in \Pi^*_t,\, j\, \text{neighbor}} \inf_{\tau \in \bar{\Lambda}_{ij}(\widehat{M}_t)} \min_{z \in \mathcal{C} \cap \mathcal{B}(1)} \sum_{k} N_{k, t} \left[z^\top \widehat{M}_{k, t} \cdot z^\top \tau_k\right] \geq c(t, \delta)

with Nk,tN_{k,t} being sample counts and c(t,δ)c(t, \delta) a logarithmic function of δ\delta and tt.

The algorithm achieves minimax optimal sample complexity matching the lower bound up to logarithmic factors, as required for δ\delta-correct PrePEx.

5. Computational and Statistical Efficiency

FraPPE exhibits a computational complexity per round of O(KL2)\mathcal{O}(K L^2), representing a practical improvement over previous convex hull–based methods whose costs could be prohibitive for large KK or LL (Das et al., 22 Aug 2025). The main computational steps are sparse due to the localized structure of the Pareto set and the geometric decomposition of the alternative sets.

Empirical results demonstrate:

  • Reductions in sample complexity by factors of 5×5\times6×6\times over gradient- and sampling-based baselines for Pareto set identification on both synthetic and real datasets.
  • Asymptotically faster identification of the exact Pareto front, even as dimensionality increases.
  • Superior empirical error rates at earlier stopping times.

The dominant computational cost for large LL is the Pareto set computation, which scales as O(Klogmax{1,L2}(K))O(K \log^{\max\{1,L-2\}}(K)) for moderate LL.

6. Comparison with Prior Pure Exploration Algorithms

The FraPPE and PreTS advances derive from and extend earlier track-and-stop, game-theoretic pure exploration, and instance-optimal algorithms (Shukla et al., 4 Dec 2024, Degenne et al., 2019):

  • Scalar BAI and Linear Bandits: Methods such as standard Track-and-Stop cannot immediately accommodate preference cones and multi-objective reward structures.
  • Convexification Approaches: Prior efforts convexify nonconvex alternative sets at the cost of scalability.
  • Posterior Sampling and Oracle-based Methods: Oracle-based methods (Degenne et al., 2019) address finite-confidence guarantees in exponential families but may become computationally infeasible for generic PrePEx.
  • FraPPE Generalization: By leveraging new structural reductions and the Frank–Wolfe optimizer, FraPPE achieves both statistical optimality and practical scalability in generic, arbitrary cone settings.

This positions FraPPE as the first algorithm to achieve both theoretical optimality and empirical efficiency for PrePEx with arbitrary cones.

7. Applications and Future Directions

Applications of PrePEx and FraPPE include:

  • Multi-objective clinical trials (selecting optimal treatments balancing efficacy, safety, cost).
  • Engineering design optimization with conflicting criteria (e.g., safety vs. efficiency).
  • Multi-objective autonomous agent evaluation where explicit scalarization is not viable.
  • Preference-driven AI tasks where specification is via partial orders, expert demonstration, or learned cones.

Future research avenues include:

  • Learning the preference cone C\mathcal{C} from preference data rather than taking it as given.
  • Extending to contextual or linear bandits, or settings with additional structure.
  • Integration into preference-based reinforcement learning for efficient exploration in complex MDPs.

The FraPPE methodology demonstrates that even in challenging multi-objective, preference-specified bandit environments, it is possible to simultaneously achieve sharp statistical efficiency (matching information-theoretic lower bounds) and practical computational scalability (Das et al., 22 Aug 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (3)