Explore-Then-Commit (ETC) Methods

Updated 5 July 2026

ETC is a class of policies that separates an initial pure exploration phase from an irrevocable commitment phase to balance information gathering with exploitation.
It is applied in settings like stochastic bandits, control systems, and decentralized matching markets to manage the trade-off between exploration costs and the risks of early commitment.
Variants such as SPRT-ETC and Double ETC demonstrate that refining the phase structure can achieve near-optimal regret performance while addressing the limitations of single-stage commitment.

Explore-Then-Commit (ETC) denotes a family of policies that allocate an initial segment of the horizon to pure exploration and then irrevocably commit to a decision rule for the remainder. In the literature considered here, the committed object ranges from a single arm in stochastic bandits to a controller in partially observable linear systems, an open-loop action sequence in nonstationary linear bandits with latent dynamics, a stable matching in decentralized two-sided markets, and an arm selected under a finite-time risk-return criterion (Garivier et al., 2016, Jin et al., 2020, Lale et al., 2020, Choi et al., 17 Oct 2025, Pagare et al., 2024, Yekkehkhany et al., 2019). The defining feature is the same across domains: ETC separates information acquisition from exploitation, so its performance is governed by the tension between the cost of exploration and the consequences of committing with imperfect information.

1. Canonical formulation

In the classical stochastic bandit setting, ETC is a two-stage strategy. For a $K$ -armed bandit with unknown means $\mu_1,\dots,\mu_K$ , one picks an exploration budget $n$ , pulls each arm $n$ times, computes empirical means $\hat\mu_i$ , and then commits to

$i^*=\arg\max_i \hat\mu_i$

for the remaining $T-Kn$ rounds (Jin et al., 2020). In the two-armed Gaussian formulation, ETC can be stated more generally through an exploration stopping time $\tau\le T$ , an $\mathcal F_\tau$ -measurable decision $\hat a\in\{1,2\}$ , alternating uniform sampling during $\mu_1,\dots,\mu_K$ 0, and commitment to $\mu_1,\dots,\mu_K$ 1 for $\mu_1,\dots,\mu_K$ 2 (Garivier et al., 2016).

The standard regret expressions already expose the ETC structure. In the $\mu_1,\dots,\mu_K$ 3-armed case,

$\mu_1,\dots,\mu_K$ 4

whereas in the two-armed Gaussian case with $\mu_1,\dots,\mu_K$ 5 and $\mu_1,\dots,\mu_K$ 6,

$\mu_1,\dots,\mu_K$ 7

with

$\mu_1,\dots,\mu_K$ 8

This yields the basic decomposition

$\mu_1,\dots,\mu_K$ 9

so ETC regret is the sum of an exploration term and a post-commitment misidentification term (Garivier et al., 2016).

A related finite-horizon formulation appears in risk-averse bandits. There, an ETC policy is a pair $n$ 0: the learner performs pure experimentation for $n$ 1 rounds and then commits to one fixed arm for $n$ 2 exploitations (Yekkehkhany et al., 2019). This preserves the stagewise architecture of ETC while changing the criterion used to select the committed arm.

2. Regret structure and the classical suboptimality of single-stage ETC

The main negative result attached to ETC is not that it fails to achieve sublinear regret, but that a single exploration phase followed by a single commitment phase is generally asymptotically suboptimal relative to fully sequential policies. In the two-armed Gaussian setting, fixed-design ETC with known gap $n$ 3 and deterministic $n$ 4 has

$n$ 5

and the optimal fixed budget satisfies

$n$ 6

A matching lower bound gives $n$ 7 (Garivier et al., 2016).

Allowing a data-dependent stopping time improves the constant but not the basic limitation. The SPRT-based ETC rule, which alternates arms until

$n$ 8

achieves

$n$ 9

hence $n$ 0, but no ETC can achieve an asymptotic constant smaller than $n$ 1 in the known-gap case (Garivier et al., 2016).

When $n$ 2 is unknown, the gap is larger. A fixed-confidence best-arm identification subroutine with risk $n$ 3 yields BAI-ETC with asymptotic behavior $n$ 4, and any ETC with unknown $n$ 5 has asymptotic cost at least $n$ 6 (Garivier et al., 2016). By contrast, fully sequential algorithms in the same paper attain smaller constants: $n$ 7-UCB satisfies $n$ 8 in the known-gap case, and UCB $n$ 9 has asymptotic regret $\hat\mu_i$ 0 in the unknown-gap case. The standard conclusion is therefore precise: ETC is not uniformly optimal because its forced separation of exploration and exploitation creates a factor-two asymptotic penalty in the two-armed Gaussian model (Garivier et al., 2016).

This corrects a common misconception. ETC is not merely a crude fixed-design heuristic; stopping-time variants such as SPRT-ETC are statistically refined. The limitation is structural: once commitment occurs, no further information can correct an early error.

3. Double ETC and batched asymptotic optimality

"Double Explore-then-Commit: Asymptotic Optimality and Beyond" shows that the preceding negative conclusion is specific to single-stage ETC rather than to stagewise learning itself (Jin et al., 2020). The proposed Double ETC (DETC) introduces two exploration phases and two commitment phases.

In the known-gap two-armed version, DETC uses

$\hat\mu_i$ 1

Stage I samples both arms until each has been pulled $\hat\mu_i$ 2 times. Stage II commits temporarily to the empirical best arm $\hat\mu_i$ 3 until it has been pulled $\hat\mu_i$ 4 times. Stage III re-opens exploration of the other arm $\hat\mu_i$ 5 and continues until

$\hat\mu_i$ 6

fails, or $\hat\mu_i$ 7. Stage IV commits to the arm with larger empirical mean (Jin et al., 2020). The unknown-gap version replaces $\hat\mu_i$ 8 with data-dependent confidence tests and uses $\hat\mu_i$ 9.

The paper’s interpretation is that the second exploration phase compares the temporarily chosen arm against fresh pulls of the unchosen arm, so the decisive sampling noise after Stage II comes from only one arm. This yields asymptotic optimality:

$i^*=\arg\max_i \hat\mu_i$ 0

in the unknown-gap two-arm case, and for $i^*=\arg\max_i \hat\mu_i$ 1 arms

$i^*=\arg\max_i \hat\mu_i$ 2

The same paper extends DETC to batched bandits and proves that a batched version attains the same asymptotic regret with only $i^*=\arg\max_i \hat\mu_i$ 3 rounds (Jin et al., 2020).

A plausible implication is that ETC should be understood less as a single algorithm than as a design principle. Once the phase structure is allowed to include re-entry into exploration, ETC-type methods can recover asymptotic optimality while retaining low round complexity.

4. ETC in partially observable linear quadratic control

In control, ETC appears in a more structured form. "Regret Minimization in Partially Observable Linear Quadratic Control" studies an unknown discrete-time partially observable linear system

$i^*=\arg\max_i \hat\mu_i$ 4

with Gaussian process and measurement noise, per-step cost

$i^*=\arg\max_i \hat\mu_i$ 5

and comparison against the optimal average cost under the information pattern $i^*=\arg\max_i \hat\mu_i$ 6 (Lale et al., 2020).

The proposed algorithm, ExpCommit, dedicates the first $i^*=\arg\max_i \hat\mu_i$ 7 steps to pure exploration by drawing

$i^*=\arg\max_i \hat\mu_i$ 8

From the resulting input-output data, it forms overlapping $i^*=\arg\max_i \hat\mu_i$ 9-length regressors and estimates the truncated Markov-parameter matrix

$T-Kn$ 0

through least squares. Under $T-Kn$ 1 and sufficiently large $T-Kn$ 2, the paper proves

$T-Kn$ 3

with high probability. Ho-Kalman is then applied to recover an order- $T-Kn$ 4 realization $T-Kn$ 5 up to a similarity transform, and high-probability confidence sets for $T-Kn$ 6 are constructed (Lale et al., 2020).

The commit phase is not a static arm choice. ExpCommit selects an optimistic model $T-Kn$ 7 inside the confidence set, computes the corresponding steady-state gains $T-Kn$ 8, and applies the certainty-equivalent controller

$T-Kn$ 9

with the state estimate updated by the Kalman filter associated with $\tau\le T$ 0 (Lale et al., 2020). The regret analysis relies on a Bellman-optimality identity and a decomposition of the commit-phase regret into

$\tau\le T$ 1

where different terms capture mismatches in $\tau\le T$ 2 and in noise cross-moments. Choosing $\tau\le T$ 3 yields

$\tau\le T$ 4

A central technical point is stability. Under the assumptions $\tau\le T$ 5, controllability, observability, and uniform contractibility, the paper shows that the optimistic covariance and filter gain satisfy

$\tau\le T$ 6

and that latent-state estimates and observations remain uniformly bounded for all $\tau\le T$ 7, preventing blow-up during commitment (Lale et al., 2020).

5. ETC for nonstationary linear bandits with latent dynamics

"Explore-then-Commit for Nonstationary Linear Bandits with Latent Dynamics" specializes ETC to a bandit problem in which rewards depend on both actions and latent states, and the latent dynamics also depend on actions (Choi et al., 17 Oct 2025). The model is

$\tau\le T$ 8

with unknown stable $\tau\le T$ 9, action set $\mathcal F_\tau$ 0 or $\mathcal F_\tau$ 1, and bilinear reward

$\mathcal F_\tau$ 2

The benchmark is the best open-loop sequence

$\mathcal F_\tau$ 3

where $\mathcal F_\tau$ 4 and $\mathcal F_\tau$ 5 is a block-Toeplitz matrix with blocks $\mathcal F_\tau$ 6 for $\mathcal F_\tau$ 7 (Choi et al., 17 Oct 2025).

The exploration phase fixes a length $\mathcal F_\tau$ 8 and plays random Rademacher actions for $\mathcal F_\tau$ 9. For Markov parameters

$\hat a\in\{1,2\}$ 0

the reward can be unrolled for $\hat a\in\{1,2\}$ 1 as

$\hat a\in\{1,2\}$ 2

with $\hat a\in\{1,2\}$ 3. Stacking covariates $\hat a\in\{1,2\}$ 4 yields the least-squares estimator

$\hat a\in\{1,2\}$ 5

If $\hat a\in\{1,2\}$ 6, then with probability at least $\hat a\in\{1,2\}$ 7,

$\hat a\in\{1,2\}$ 8

so choosing $\hat a\in\{1,2\}$ 9 makes truncation negligible and leaves a statistical error of order $\mu_1,\dots,\mu_K$ 00 (Choi et al., 17 Oct 2025).

In the commit phase, the algorithm forms an estimated block-Toeplitz matrix $\mu_1,\dots,\mu_K$ 01 for $\mu_1,\dots,\mu_K$ 02, sets $\mu_1,\dots,\mu_K$ 03, and solves

$\mu_1,\dots,\mu_K$ 04

This is an indefinite quadratic form over the hypercube, i.e. a QUBO, and is NP-hard. The practical method proposed in the paper is an SDP relaxation with Goemans-Williamson rounding:

$\mu_1,\dots,\mu_K$ 05

followed by $\mu_1,\dots,\mu_K$ 06, a Gaussian random vector $\mu_1,\dots,\mu_K$ 07, and the rounded decision $\mu_1,\dots,\mu_K$ 08. The guarantee is

$\mu_1,\dots,\mu_K$ 09

and the rounded $\mu_1,\dots,\mu_K$ 10 satisfies

$\mu_1,\dots,\mu_K$ 11

for some constant $\mu_1,\dots,\mu_K$ 12 (Choi et al., 17 Oct 2025).

The regret decomposition is

$\mu_1,\dots,\mu_K$ 13

where $\mu_1,\dots,\mu_K$ 14 is the cost of exploration, $\mu_1,\dots,\mu_K$ 15 is the optimization error induced by model estimation, and $\mu_1,\dots,\mu_K$ 16 is the QUBO-rounding loss. With $\mu_1,\dots,\mu_K$ 17, $\mu_1,\dots,\mu_K$ 18, and $\mu_1,\dots,\mu_K$ 19, the paper obtains

$\mu_1,\dots,\mu_K$ 20

with high probability (Choi et al., 17 Oct 2025).

6. Alternative commitment objectives: matching markets and risk-averse finite-time exploitation

ETC has also been adapted to decentralized matching markets. In "Explore-then-Commit Algorithms for Decentralized Two-Sided Matching Markets," the epoch-based collision-avoidance ETC algorithm, CA-ETC, addresses a setting with $\mu_1,\dots,\mu_K$ 21 players, $\mu_1,\dots,\mu_K$ 22 arms, unknown strict preference orderings on both sides, and collision resolution in which an arm accepts its most preferred proposer (Pagare et al., 2024). CA-ETC first performs an index-estimation procedure of length $\mu_1,\dots,\mu_K$ 23, then runs epochs with total length

$\mu_1,\dots,\mu_K$ 24

Exploration is collision-free because player $\mu_1,\dots,\mu_K$ 25 pulls

$\mu_1,\dots,\mu_K$ 26

so distinct indices imply distinct arms in each round. Ranking estimation uses empirical means and confidence intervals

$\mu_1,\dots,\mu_K$ 27

and the commit sub-phase runs deferred acceptance on the estimated rankings. The main guarantee is

$\mu_1,\dots,\mu_K$ 28

up to lower-order terms, whereas a blackboard baseline with communication achieves logarithmic regret $\mu_1,\dots,\mu_K$ 29 (Pagare et al., 2024). The trade-off is explicit: decentralization removes communication but worsens the horizon dependence.

A different modification concerns the criterion used at commitment. "Risk-Averse Explore-Then-Commit Algorithms for Finite-Time Bandits" does not define the best arm by expected reward alone (Yekkehkhany et al., 2019). For exploitation horizon $\mu_1,\dots,\mu_K$ 30, it considers

$\mu_1,\dots,\mu_K$ 31

and commits to

$\mu_1,\dots,\mu_K$ 32

The OTE-MAB algorithm treats the case $\mu_1,\dots,\mu_K$ 33, and FTE-MAB handles general $\mu_1,\dots,\mu_K$ 34 by estimating $\mu_1,\dots,\mu_K$ 35 from exploration samples and then committing to the empirical maximizer. The finite-time guarantees are explicit: if

$\mu_1,\dots,\mu_K$ 36

then OTE-MAB ensures $\mu_1,\dots,\mu_K$ 37, and if

$\mu_1,\dots,\mu_K$ 38

then the same holds for FTE-MAB (Yekkehkhany et al., 2019). Here ETC is retained, but the committed object is optimized for the probability of outperforming alternatives over a finite exploitation window rather than for asymptotic mean reward.

Taken together, these variants show that ETC is a broad architectural template rather than a single regret profile. Depending on the domain, commitment may target an arm, a ranking, a controller, or an open-loop sequence; the central technical question is always how much structure can be learned during the explore phase, and how costly it is to stop learning thereafter.