Preference-based Pure Exploration (PrePEx)
- Preference-based Pure Exploration (PrePEx) is a framework that identifies Pareto optimal arms in multi-objective bandits using preference cones to induce partial orders.
- It leverages sample complexity lower bounds and structural reductions with algorithms like PreTS and FraPPE to enhance exploration efficiency.
- The approach applies to domains such as clinical trials, engineering design, and autonomous systems, with potential to learn preference cones from data.
Preference-based Pure Exploration (PrePEx) is a framework for sample-efficient identification of Pareto optimal arms or policies in stochastic environments where feedback is provided in the form of preferences, and rewards are vector-valued and partially ordered by a given preference cone. PrePEx generalizes scalar best-arm identification to the multi-objective, preference-driven regime, requiring new algorithmic, statistical, and computational tools to match lower complexity bounds under arbitrary preference structures. Recent work (notably (Shukla et al., 4 Dec 2024, Das et al., 22 Aug 2025)) has advanced the theory, algorithm design, and implementation of PrePEx, enabling scalable and rigorous Pareto set discovery in vector-valued bandits.
1. Problem Definition and Preference Modeling
The PrePEx framework considers a K-armed bandit where each arm yields an -dimensional reward vector upon pull. The arms are compared using a convex, closed, pointed preference cone , which induces a partial ordering: vector is preferred to if . The goal is to identify, with confidence at least , the set of Pareto optimal arms—that is, the subset that is undominated with respect to .
Mathematically, the task is to solve:
$\max_{\pi \in \Delta_K} M \pi \text{ (Pareto sense under $\mathcal{C}$)},$
with the -simplex, and to output the empirical Pareto front
The formulation generalizes scalar best-arm identification to settings with multiple objectives and nontrivial preference structures.
2. Sample Complexity Lower Bound and Cone Geometry
A core theoretical contribution of PrePEx is the information-theoretic lower bound on sample complexity for any -correct algorithm (Shukla et al., 4 Dec 2024, Das et al., 22 Aug 2025). The bound is:
where is the allocation vector, a suitable divergence (e.g., KL), and is a set of "confusing" alternative instances.
For Gaussian rewards, the expression specializes to a bilinear projection form:
The essential difficulty of PrePEx is controlled not only by the reward gaps but also by the geometry of (e.g., width, orientation, facets), which determines which vector differences are hardest to distinguish.
3. Algorithmic Foundations: Track-and-Stop and Structural Reduction
Preference-based Track-and-Stop (PreTS) (Shukla et al., 4 Dec 2024) and its computationally efficient generalization FraPPE (Das et al., 22 Aug 2025) form the core algorithmic approaches.
Key methodological steps are:
- Sequential Allocation: At each round, estimate means and compute a probability vector by (approximately) solving the lower-bound optimization. Pull the arm most under-sampled relative to .
- Structural Properties: FraPPE exploits three properties:
- The Pareto set is spanned by pure policies tied to Pareto arms.
- For each Pareto optimal arm , the most informative alternatives are its "neighbors" on the Pareto front (restricted minimization).
- The confusing set of alternatives (Alt-set) partitions into a union of convex sets, eliminating expensive convex hulls.
Frank-Wolfe Optimization: FraPPE employs a projection-free Frank-Wolfe method for the outer maximization over , achieving low per-iteration cost— for K arms, L objectives.
- Stopping Rule: Based on Chernoff bounds, the algorithm halts when the cumulative empirical information (projected divergences along cone directions) certifies the empirical Pareto set with high confidence.
This approach ensures that the allocation follows the minimax optimal allocation dictated by the information constraints of the cone-ordered, multi-objective bandit.
4. Mathematical Formulation and Optimality
FraPPE operationalizes the lower-bound computation via:
where is the restricted alternative set associated with Pareto neighbor pairs.
Stopping occurs if, for every candidate and neighbor, the empirical cumulative information is above a threshold:
with being sample counts and a logarithmic function of and .
The algorithm achieves minimax optimal sample complexity matching the lower bound up to logarithmic factors, as required for -correct PrePEx.
5. Computational and Statistical Efficiency
FraPPE exhibits a computational complexity per round of , representing a practical improvement over previous convex hull–based methods whose costs could be prohibitive for large or (Das et al., 22 Aug 2025). The main computational steps are sparse due to the localized structure of the Pareto set and the geometric decomposition of the alternative sets.
Empirical results demonstrate:
- Reductions in sample complexity by factors of – over gradient- and sampling-based baselines for Pareto set identification on both synthetic and real datasets.
- Asymptotically faster identification of the exact Pareto front, even as dimensionality increases.
- Superior empirical error rates at earlier stopping times.
The dominant computational cost for large is the Pareto set computation, which scales as for moderate .
6. Comparison with Prior Pure Exploration Algorithms
The FraPPE and PreTS advances derive from and extend earlier track-and-stop, game-theoretic pure exploration, and instance-optimal algorithms (Shukla et al., 4 Dec 2024, Degenne et al., 2019):
- Scalar BAI and Linear Bandits: Methods such as standard Track-and-Stop cannot immediately accommodate preference cones and multi-objective reward structures.
- Convexification Approaches: Prior efforts convexify nonconvex alternative sets at the cost of scalability.
- Posterior Sampling and Oracle-based Methods: Oracle-based methods (Degenne et al., 2019) address finite-confidence guarantees in exponential families but may become computationally infeasible for generic PrePEx.
- FraPPE Generalization: By leveraging new structural reductions and the Frank–Wolfe optimizer, FraPPE achieves both statistical optimality and practical scalability in generic, arbitrary cone settings.
This positions FraPPE as the first algorithm to achieve both theoretical optimality and empirical efficiency for PrePEx with arbitrary cones.
7. Applications and Future Directions
Applications of PrePEx and FraPPE include:
- Multi-objective clinical trials (selecting optimal treatments balancing efficacy, safety, cost).
- Engineering design optimization with conflicting criteria (e.g., safety vs. efficiency).
- Multi-objective autonomous agent evaluation where explicit scalarization is not viable.
- Preference-driven AI tasks where specification is via partial orders, expert demonstration, or learned cones.
Future research avenues include:
- Learning the preference cone from preference data rather than taking it as given.
- Extending to contextual or linear bandits, or settings with additional structure.
- Integration into preference-based reinforcement learning for efficient exploration in complex MDPs.
The FraPPE methodology demonstrates that even in challenging multi-objective, preference-specified bandit environments, it is possible to simultaneously achieve sharp statistical efficiency (matching information-theoretic lower bounds) and practical computational scalability (Das et al., 22 Aug 2025).