Papers
Topics
Authors
Recent
Search
2000 character limit reached

Explore-Then-Commit (ETC) Methods

Updated 5 July 2026
  • ETC is a class of policies that separates an initial pure exploration phase from an irrevocable commitment phase to balance information gathering with exploitation.
  • It is applied in settings like stochastic bandits, control systems, and decentralized matching markets to manage the trade-off between exploration costs and the risks of early commitment.
  • Variants such as SPRT-ETC and Double ETC demonstrate that refining the phase structure can achieve near-optimal regret performance while addressing the limitations of single-stage commitment.

Explore-Then-Commit (ETC) denotes a family of policies that allocate an initial segment of the horizon to pure exploration and then irrevocably commit to a decision rule for the remainder. In the literature considered here, the committed object ranges from a single arm in stochastic bandits to a controller in partially observable linear systems, an open-loop action sequence in nonstationary linear bandits with latent dynamics, a stable matching in decentralized two-sided markets, and an arm selected under a finite-time risk-return criterion (Garivier et al., 2016, Jin et al., 2020, Lale et al., 2020, Choi et al., 17 Oct 2025, Pagare et al., 2024, Yekkehkhany et al., 2019). The defining feature is the same across domains: ETC separates information acquisition from exploitation, so its performance is governed by the tension between the cost of exploration and the consequences of committing with imperfect information.

1. Canonical formulation

In the classical stochastic bandit setting, ETC is a two-stage strategy. For a KK-armed bandit with unknown means μ1,,μK\mu_1,\dots,\mu_K, one picks an exploration budget nn, pulls each arm nn times, computes empirical means μ^i\hat\mu_i, and then commits to

i=argmaxiμ^ii^*=\arg\max_i \hat\mu_i

for the remaining TKnT-Kn rounds (Jin et al., 2020). In the two-armed Gaussian formulation, ETC can be stated more generally through an exploration stopping time τT\tau\le T, an Fτ\mathcal F_\tau-measurable decision a^{1,2}\hat a\in\{1,2\}, alternating uniform sampling during μ1,,μK\mu_1,\dots,\mu_K0, and commitment to μ1,,μK\mu_1,\dots,\mu_K1 for μ1,,μK\mu_1,\dots,\mu_K2 (Garivier et al., 2016).

The standard regret expressions already expose the ETC structure. In the μ1,,μK\mu_1,\dots,\mu_K3-armed case,

μ1,,μK\mu_1,\dots,\mu_K4

whereas in the two-armed Gaussian case with μ1,,μK\mu_1,\dots,\mu_K5 and μ1,,μK\mu_1,\dots,\mu_K6,

μ1,,μK\mu_1,\dots,\mu_K7

with

μ1,,μK\mu_1,\dots,\mu_K8

This yields the basic decomposition

μ1,,μK\mu_1,\dots,\mu_K9

so ETC regret is the sum of an exploration term and a post-commitment misidentification term (Garivier et al., 2016).

A related finite-horizon formulation appears in risk-averse bandits. There, an ETC policy is a pair nn0: the learner performs pure experimentation for nn1 rounds and then commits to one fixed arm for nn2 exploitations (Yekkehkhany et al., 2019). This preserves the stagewise architecture of ETC while changing the criterion used to select the committed arm.

2. Regret structure and the classical suboptimality of single-stage ETC

The main negative result attached to ETC is not that it fails to achieve sublinear regret, but that a single exploration phase followed by a single commitment phase is generally asymptotically suboptimal relative to fully sequential policies. In the two-armed Gaussian setting, fixed-design ETC with known gap nn3 and deterministic nn4 has

nn5

and the optimal fixed budget satisfies

nn6

A matching lower bound gives nn7 (Garivier et al., 2016).

Allowing a data-dependent stopping time improves the constant but not the basic limitation. The SPRT-based ETC rule, which alternates arms until

nn8

achieves

nn9

hence nn0, but no ETC can achieve an asymptotic constant smaller than nn1 in the known-gap case (Garivier et al., 2016).

When nn2 is unknown, the gap is larger. A fixed-confidence best-arm identification subroutine with risk nn3 yields BAI-ETC with asymptotic behavior nn4, and any ETC with unknown nn5 has asymptotic cost at least nn6 (Garivier et al., 2016). By contrast, fully sequential algorithms in the same paper attain smaller constants: nn7-UCB satisfies nn8 in the known-gap case, and UCBnn9 has asymptotic regret μ^i\hat\mu_i0 in the unknown-gap case. The standard conclusion is therefore precise: ETC is not uniformly optimal because its forced separation of exploration and exploitation creates a factor-two asymptotic penalty in the two-armed Gaussian model (Garivier et al., 2016).

This corrects a common misconception. ETC is not merely a crude fixed-design heuristic; stopping-time variants such as SPRT-ETC are statistically refined. The limitation is structural: once commitment occurs, no further information can correct an early error.

3. Double ETC and batched asymptotic optimality

"Double Explore-then-Commit: Asymptotic Optimality and Beyond" shows that the preceding negative conclusion is specific to single-stage ETC rather than to stagewise learning itself (Jin et al., 2020). The proposed Double ETC (DETC) introduces two exploration phases and two commitment phases.

In the known-gap two-armed version, DETC uses

μ^i\hat\mu_i1

Stage I samples both arms until each has been pulled μ^i\hat\mu_i2 times. Stage II commits temporarily to the empirical best arm μ^i\hat\mu_i3 until it has been pulled μ^i\hat\mu_i4 times. Stage III re-opens exploration of the other arm μ^i\hat\mu_i5 and continues until

μ^i\hat\mu_i6

fails, or μ^i\hat\mu_i7. Stage IV commits to the arm with larger empirical mean (Jin et al., 2020). The unknown-gap version replaces μ^i\hat\mu_i8 with data-dependent confidence tests and uses μ^i\hat\mu_i9.

The paper’s interpretation is that the second exploration phase compares the temporarily chosen arm against fresh pulls of the unchosen arm, so the decisive sampling noise after Stage II comes from only one arm. This yields asymptotic optimality:

i=argmaxiμ^ii^*=\arg\max_i \hat\mu_i0

in the unknown-gap two-arm case, and for i=argmaxiμ^ii^*=\arg\max_i \hat\mu_i1 arms

i=argmaxiμ^ii^*=\arg\max_i \hat\mu_i2

The same paper extends DETC to batched bandits and proves that a batched version attains the same asymptotic regret with only i=argmaxiμ^ii^*=\arg\max_i \hat\mu_i3 rounds (Jin et al., 2020).

A plausible implication is that ETC should be understood less as a single algorithm than as a design principle. Once the phase structure is allowed to include re-entry into exploration, ETC-type methods can recover asymptotic optimality while retaining low round complexity.

4. ETC in partially observable linear quadratic control

In control, ETC appears in a more structured form. "Regret Minimization in Partially Observable Linear Quadratic Control" studies an unknown discrete-time partially observable linear system

i=argmaxiμ^ii^*=\arg\max_i \hat\mu_i4

with Gaussian process and measurement noise, per-step cost

i=argmaxiμ^ii^*=\arg\max_i \hat\mu_i5

and comparison against the optimal average cost under the information pattern i=argmaxiμ^ii^*=\arg\max_i \hat\mu_i6 (Lale et al., 2020).

The proposed algorithm, ExpCommit, dedicates the first i=argmaxiμ^ii^*=\arg\max_i \hat\mu_i7 steps to pure exploration by drawing

i=argmaxiμ^ii^*=\arg\max_i \hat\mu_i8

From the resulting input-output data, it forms overlapping i=argmaxiμ^ii^*=\arg\max_i \hat\mu_i9-length regressors and estimates the truncated Markov-parameter matrix

TKnT-Kn0

through least squares. Under TKnT-Kn1 and sufficiently large TKnT-Kn2, the paper proves

TKnT-Kn3

with high probability. Ho-Kalman is then applied to recover an order-TKnT-Kn4 realization TKnT-Kn5 up to a similarity transform, and high-probability confidence sets for TKnT-Kn6 are constructed (Lale et al., 2020).

The commit phase is not a static arm choice. ExpCommit selects an optimistic model TKnT-Kn7 inside the confidence set, computes the corresponding steady-state gains TKnT-Kn8, and applies the certainty-equivalent controller

TKnT-Kn9

with the state estimate updated by the Kalman filter associated with τT\tau\le T0 (Lale et al., 2020). The regret analysis relies on a Bellman-optimality identity and a decomposition of the commit-phase regret into

τT\tau\le T1

where different terms capture mismatches in τT\tau\le T2 and in noise cross-moments. Choosing τT\tau\le T3 yields

τT\tau\le T4

A central technical point is stability. Under the assumptions τT\tau\le T5, controllability, observability, and uniform contractibility, the paper shows that the optimistic covariance and filter gain satisfy

τT\tau\le T6

and that latent-state estimates and observations remain uniformly bounded for all τT\tau\le T7, preventing blow-up during commitment (Lale et al., 2020).

5. ETC for nonstationary linear bandits with latent dynamics

"Explore-then-Commit for Nonstationary Linear Bandits with Latent Dynamics" specializes ETC to a bandit problem in which rewards depend on both actions and latent states, and the latent dynamics also depend on actions (Choi et al., 17 Oct 2025). The model is

τT\tau\le T8

with unknown stable τT\tau\le T9, action set Fτ\mathcal F_\tau0 or Fτ\mathcal F_\tau1, and bilinear reward

Fτ\mathcal F_\tau2

The benchmark is the best open-loop sequence

Fτ\mathcal F_\tau3

where Fτ\mathcal F_\tau4 and Fτ\mathcal F_\tau5 is a block-Toeplitz matrix with blocks Fτ\mathcal F_\tau6 for Fτ\mathcal F_\tau7 (Choi et al., 17 Oct 2025).

The exploration phase fixes a length Fτ\mathcal F_\tau8 and plays random Rademacher actions for Fτ\mathcal F_\tau9. For Markov parameters

a^{1,2}\hat a\in\{1,2\}0

the reward can be unrolled for a^{1,2}\hat a\in\{1,2\}1 as

a^{1,2}\hat a\in\{1,2\}2

with a^{1,2}\hat a\in\{1,2\}3. Stacking covariates a^{1,2}\hat a\in\{1,2\}4 yields the least-squares estimator

a^{1,2}\hat a\in\{1,2\}5

If a^{1,2}\hat a\in\{1,2\}6, then with probability at least a^{1,2}\hat a\in\{1,2\}7,

a^{1,2}\hat a\in\{1,2\}8

so choosing a^{1,2}\hat a\in\{1,2\}9 makes truncation negligible and leaves a statistical error of order μ1,,μK\mu_1,\dots,\mu_K00 (Choi et al., 17 Oct 2025).

In the commit phase, the algorithm forms an estimated block-Toeplitz matrix μ1,,μK\mu_1,\dots,\mu_K01 for μ1,,μK\mu_1,\dots,\mu_K02, sets μ1,,μK\mu_1,\dots,\mu_K03, and solves

μ1,,μK\mu_1,\dots,\mu_K04

This is an indefinite quadratic form over the hypercube, i.e. a QUBO, and is NP-hard. The practical method proposed in the paper is an SDP relaxation with Goemans-Williamson rounding:

μ1,,μK\mu_1,\dots,\mu_K05

followed by μ1,,μK\mu_1,\dots,\mu_K06, a Gaussian random vector μ1,,μK\mu_1,\dots,\mu_K07, and the rounded decision μ1,,μK\mu_1,\dots,\mu_K08. The guarantee is

μ1,,μK\mu_1,\dots,\mu_K09

and the rounded μ1,,μK\mu_1,\dots,\mu_K10 satisfies

μ1,,μK\mu_1,\dots,\mu_K11

for some constant μ1,,μK\mu_1,\dots,\mu_K12 (Choi et al., 17 Oct 2025).

The regret decomposition is

μ1,,μK\mu_1,\dots,\mu_K13

where μ1,,μK\mu_1,\dots,\mu_K14 is the cost of exploration, μ1,,μK\mu_1,\dots,\mu_K15 is the optimization error induced by model estimation, and μ1,,μK\mu_1,\dots,\mu_K16 is the QUBO-rounding loss. With μ1,,μK\mu_1,\dots,\mu_K17, μ1,,μK\mu_1,\dots,\mu_K18, and μ1,,μK\mu_1,\dots,\mu_K19, the paper obtains

μ1,,μK\mu_1,\dots,\mu_K20

with high probability (Choi et al., 17 Oct 2025).

6. Alternative commitment objectives: matching markets and risk-averse finite-time exploitation

ETC has also been adapted to decentralized matching markets. In "Explore-then-Commit Algorithms for Decentralized Two-Sided Matching Markets," the epoch-based collision-avoidance ETC algorithm, CA-ETC, addresses a setting with μ1,,μK\mu_1,\dots,\mu_K21 players, μ1,,μK\mu_1,\dots,\mu_K22 arms, unknown strict preference orderings on both sides, and collision resolution in which an arm accepts its most preferred proposer (Pagare et al., 2024). CA-ETC first performs an index-estimation procedure of length μ1,,μK\mu_1,\dots,\mu_K23, then runs epochs with total length

μ1,,μK\mu_1,\dots,\mu_K24

Exploration is collision-free because player μ1,,μK\mu_1,\dots,\mu_K25 pulls

μ1,,μK\mu_1,\dots,\mu_K26

so distinct indices imply distinct arms in each round. Ranking estimation uses empirical means and confidence intervals

μ1,,μK\mu_1,\dots,\mu_K27

and the commit sub-phase runs deferred acceptance on the estimated rankings. The main guarantee is

μ1,,μK\mu_1,\dots,\mu_K28

up to lower-order terms, whereas a blackboard baseline with communication achieves logarithmic regret μ1,,μK\mu_1,\dots,\mu_K29 (Pagare et al., 2024). The trade-off is explicit: decentralization removes communication but worsens the horizon dependence.

A different modification concerns the criterion used at commitment. "Risk-Averse Explore-Then-Commit Algorithms for Finite-Time Bandits" does not define the best arm by expected reward alone (Yekkehkhany et al., 2019). For exploitation horizon μ1,,μK\mu_1,\dots,\mu_K30, it considers

μ1,,μK\mu_1,\dots,\mu_K31

and commits to

μ1,,μK\mu_1,\dots,\mu_K32

The OTE-MAB algorithm treats the case μ1,,μK\mu_1,\dots,\mu_K33, and FTE-MAB handles general μ1,,μK\mu_1,\dots,\mu_K34 by estimating μ1,,μK\mu_1,\dots,\mu_K35 from exploration samples and then committing to the empirical maximizer. The finite-time guarantees are explicit: if

μ1,,μK\mu_1,\dots,\mu_K36

then OTE-MAB ensures μ1,,μK\mu_1,\dots,\mu_K37, and if

μ1,,μK\mu_1,\dots,\mu_K38

then the same holds for FTE-MAB (Yekkehkhany et al., 2019). Here ETC is retained, but the committed object is optimized for the probability of outperforming alternatives over a finite exploitation window rather than for asymptotic mean reward.

Taken together, these variants show that ETC is a broad architectural template rather than a single regret profile. Depending on the domain, commitment may target an arm, a ranking, a controller, or an open-loop sequence; the central technical question is always how much structure can be learned during the explore phase, and how costly it is to stop learning thereafter.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Explore-Then-Commit (ETC).