Double Explore-Then-Commit (DETC) Frameworks

Updated 5 July 2026

Double Explore-Then-Commit (DETC) is a framework featuring two exploration and commitment phases to decouple estimation from exploitation.
It achieves asymptotically optimal regret in batched stochastic bandits by refining traditional ETC methods with a four-stage design.
DETC also generalizes to decentralized matching and multi-firm pricing, enabling synchronized, two-sided learning and coordinated market outcomes.

Searching arXiv for DETC-related papers to ground the article in current literature. Double Explore-Then-Commit (DETC) denotes a family of sequential decision procedures built around two exploration periods separated by an intermediate commitment, or, more broadly, around multiple agents or market sides that explore and then commit in a synchronized way. In the stochastic bandit literature, DETC was introduced as a four-stage refinement of standard explore-then-commit that attains the asymptotically optimal logarithmic regret constants while retaining a non-fully-sequential structure (Jin et al., 2020). In related literatures, the label is used more broadly: in decentralized two-sided matching markets it describes a double-sided explore-and-commit protocol in which both players and arms learn and then commit through Gale–Shapley (Pagare et al., 2024), while in multi-firm algorithmic pricing it names a simultaneous multi-firm explore-then-commit pipeline whose misspecified exploitation phase can produce sustained supra-competitive prices (Baek et al., 15 May 2026). The term therefore refers less to a single invariant algorithm than to a structural pattern: exploration is separated from durable exploitation or commitment, and the “double” feature arises either from two exploration–commit pairs or from two interacting decision-making entities entering commitment together.

1. Conceptual origin and relation to explore-then-commit

Standard explore-then-commit (ETC) consists of an exploration phase followed by an exploitation phase. In the two-armed Gaussian bandit setting, ETC is formalized by alternating samples from the two arms up to a stopping time $\tau$ , then committing to a single arm $\hat a$ for all remaining rounds (Garivier et al., 2016). Its regret decomposes through the number of exploratory pulls and the probability of committing to the wrong arm, which makes explicit the statistical cost of finite exploration before permanent exploitation (Garivier et al., 2016).

DETC emerged from the observation that a single exploration phase imposes a statistical bottleneck. In the subgaussian multi-armed bandit setting, the 2020 DETC paper replaces the single explore-then-commit pattern with two exploration stages and two commitment stages, producing a four-stage architecture that preserves batchability while matching the asymptotic constants achieved by fully sequential policies (Jin et al., 2020). The core intuition stated in that work is that the extra exploration–commit pair decouples the estimation burden: one commitment stage is used to estimate the apparent winner accurately, and the later re-exploration focuses on distinguishing the winner from the remaining arms (Jin et al., 2020).

A broader interpretation of DETC appears in later work. In decentralized matching, the “double” refers to both sides of the market learning and then committing in sync (Pagare et al., 2024). In multi-firm pricing, DETC is “exactly this multi-firm pipeline with the additional emphasis that two or multiple firms explore and then commit simultaneously to myopic updates based on their misspecified estimates,” so the term identifies a joint institutional timing rather than a particular bandit stopping rule (Baek et al., 15 May 2026).

2. Formal bandit DETC and asymptotic optimality

In stochastic $K$ -armed bandits with 1-subgaussian rewards, horizon $T$ , unique best arm $i^*$ , means $\mu_i$ , and gaps $\Delta_i=\mu^*-\mu_i$ , regret is

$R_T=\mathbb{E}\!\left[\sum_{t=1}^T(\mu^*-\mu_{A_t})\right] =\sum_{i\neq i^*}\Delta_i\,\mathbb{E}[N_i(T)].$

The asymptotic lower bounds targeted by DETC are $\sum_{i\neq i^*} 2/\Delta_i$ in the unknown-gap case and $\sum_{i\neq i^*} 1/(2\Delta_i)$ when the gaps are known (Jin et al., 2020).

For two arms with known gap $\hat a$ 0, DETC uses four stages. Stage I performs uniform exploration for

$\hat a$ 1

Stage II commits to the empirical best arm until its total number of pulls reaches

$\hat a$ 2

with $\hat a$ 3. Stage III re-explores only the other arm until

$\hat a$ 4

Stage IV then commits permanently to the empirically better of the two estimates (Jin et al., 2020).

For two arms with unknown gap, the same four-stage pattern is retained, but the thresholds become self-normalized. Stage I stops at the smallest $\hat a$ 5 such that

$\hat a$ 6

and Stage III stops at the first $\hat a$ 7 such that

$\hat a$ 8

A further variant adds a small-gap detector by imposing $\hat a$ 9 and re-running a short uniform exploration if this cap is exceeded (Jin et al., 2020).

The resulting asymptotic guarantees are sharp. For two arms, DETC satisfies

$K$ 0

when $K$ 1 is known, and

$K$ 2

when $K$ 3 is unknown (Jin et al., 2020). For $K$ 4 arms with unknown gaps, the $K$ 5-armed extension achieves

$K$ 6

which matches the targeted asymptotic lower bound (Jin et al., 2020).

A central significance of these results is that DETC is presented as “the first non-fully-sequential algorithm that achieves such asymptotic optimality” and, in batched settings, the first such algorithm to obtain optimal asymptotic regret and constant round complexity simultaneously (Jin et al., 2020). This suggests that DETC occupies a specific niche between fully sequential index policies and classical fixed-stage batching.

3. DETC versus ETC-type lower bounds

The most important cautionary comparison comes from the earlier analysis of ETC-type strategies in two-armed Gaussian bandits. There, ETC means any policy with a finite exploration phase, possibly data-dependent, followed by permanent commitment to one arm (Garivier et al., 2016). The lower-bound argument extends to any strategy that “performs a finite number of exploration phases and then commits for the remainder of the horizon,” because one can take $K$ 7 to be the final stopping time at which exploration ends and permanent exploitation begins (Garivier et al., 2016).

For known gap $K$ 8, any uniformly efficient ETC strategy must satisfy

$K$ 9

while the optimal fully sequential constant is

$T$ 0

for $T$ 1-UCB, matching the lower bound for general strategies (Garivier et al., 2016). For unknown $T$ 2, the corresponding ETC lower bound is

$T$ 3

whereas UCB* achieves

$T$ 4

again matching the general lower bound (Garivier et al., 2016).

These results matter for DETC because they identify a semantic boundary. A policy called DETC is not automatically asymptotically optimal. If it is merely an ETC-type strategy with a finite final commitment time, then the factor-2 penalty proven in the Gaussian two-arm analysis still applies (Garivier et al., 2016). The 2020 bandit DETC escapes that barrier not by abandoning the broad explore-then-commit template, but by restructuring the information acquisition problem through a second exploration–commit pair and by avoiding the single-stage statistical coupling identified as the source of ETC’s constant loss (Jin et al., 2020).

A common misconception is therefore that “double” alone removes ETC suboptimality. The literature does not support that formulation. The relevant distinction is whether the second exploration phase changes the error allocation sufficiently to attain the Lai–Robbins-type constants, as in the specific DETC constructions of the 2020 bandit paper (Jin et al., 2020), rather than merely inserting an additional finite phase before a final irreversible commit (Garivier et al., 2016).

4. Double-sided DETC in decentralized matching markets

In decentralized two-sided matching markets, DETC refers to a structurally different object. The setting contains $T$ 5 players and $T$ 6 arms, time $T$ 7, unknown player-side means $T$ 8, unknown arm-side means $T$ 9, and stable matchings defined through the absence of blocking pairs (Pagare et al., 2024). Regret is measured against the player-optimal stable matching $i^*$ 0: $i^*$ 1

The proposed algorithm, epoch-based CA-ETC, is described as a “multi-phase explore-then-commit type algorithm” that is decentralized and communication-free (Pagare et al., 2024). The provided mapping states that Double Explore-Then-Commit here means that “both sides explore, learn preference rankings with confidence-separation, and then both sides commit (defer acceptance) using Gale–Shapley” (Pagare et al., 2024). The “double” therefore refers to symmetric learning and synchronized commitment by both market sides rather than to the four-stage bandit DETC architecture.

Each epoch has three main components. First, players perform an index-estimation subroutine by repeatedly proposing to a designated arm until accepted, thereby obtaining distinct indices without communication (Pagare et al., 2024). Second, the exploration phase lasts $i^*$ 2 rounds in epoch $i^*$ 3, during which round-robin scheduling avoids collisions and both players and arms update empirical means and confidence bounds (Pagare et al., 2024). Third, the commit phase lasts

$i^*$ 4

and runs decentralized Gale–Shapley using learned rankings if confidence intervals are separated, or fixed arbitrary rankings otherwise (Pagare et al., 2024).

The confidence-separation condition is explicit. A player $i^*$ 5 checks whether there exists a permutation $i^*$ 6 such that

$i^*$ 7

with analogous conditions for arms (Pagare et al., 2024). The universal gap is

$i^*$ 8

where $i^*$ 9 and $\mu_i$ 0 are the minimal within-player and within-arm preference gaps (Pagare et al., 2024).

Under the condition

$\mu_i$ 1

Theorem 2 gives the exact bound

$\mu_i$ 2

and the simplified order bound

$\mu_i$ 3

(Pagare et al., 2024).

A separate blackboard baseline, ETGS, assumes a global Boolean array and achieves

$\mu_i$ 4

(Pagare et al., 2024). This contrast clarifies that in matching markets DETC is associated with decentralization and two-sided learning, not with asymptotic optimality in the narrow Lai–Robbins sense used in bandits.

5. DETC as a multi-firm pricing pipeline

In the algorithmic pricing literature, DETC is mapped onto a multi-firm explore-then-exploit pipeline rather than a bandit arm-identification algorithm. Firms $\mu_i$ 5 face linear multi-product demand

$\mu_i$ 6

with $\mu_i$ 7, $\mu_i$ 8, and prices constrained to $\mu_i$ 9 (Baek et al., 15 May 2026). During the exploration phase of length $\Delta_i=\mu^*-\mu_i$ 0, firms randomize prices independently according to a distribution with mean vector $\Delta_i=\mu^*-\mu_i$ 1 and diagonal covariance $\Delta_i=\mu^*-\mu_i$ 2 (Baek et al., 15 May 2026).

After exploration, each firm fits the misspecified monopoly-style demand model

$\Delta_i=\mu^*-\mu_i$ 3

using only its own historical data and sets the next price myopically by solving

$\Delta_i=\mu^*-\mu_i$ 4

or, with costs,

$\Delta_i=\mu^*-\mu_i$ 5

When $\Delta_i=\mu^*-\mu_i$ 6, this reduces to

$\Delta_i=\mu^*-\mu_i$ 7

The paper states that DETC is “exactly this multi-firm pipeline with the additional emphasis that two or multiple firms explore and then commit simultaneously to myopic updates based on their misspecified estimates” (Baek et al., 15 May 2026).

The central mechanism is omitted-variable bias. True demand depends on rivals’ prices,

$\Delta_i=\mu^*-\mu_i$ 8

but firms estimate a one-dimensional own-price demand curve instead (Baek et al., 15 May 2026). Positive cross-firm price covariance makes $\Delta_i=\mu^*-\mu_i$ 9 less negative, so perceived demand is less elastic and the resulting myopic markup rises (Baek et al., 15 May 2026).

The fluid-limit ODE tracks running means $R_T=\mathbb{E}\!\left[\sum_{t=1}^T(\mu^*-\mu_{A_t})\right] =\sum_{i\neq i^*}\Delta_i\,\mathbb{E}[N_i(T)].$ 0 and accumulated centered covariances $R_T=\mathbb{E}\!\left[\sum_{t=1}^T(\mu^*-\mu_{A_t})\right] =\sum_{i\neq i^*}\Delta_i\,\mathbb{E}[N_i(T)].$ 1: $R_T=\mathbb{E}\!\left[\sum_{t=1}^T(\mu^*-\mu_{A_t})\right] =\sum_{i\neq i^*}\Delta_i\,\mathbb{E}[N_i(T)].$ 2

$R_T=\mathbb{E}\!\left[\sum_{t=1}^T(\mu^*-\mu_{A_t})\right] =\sum_{i\neq i^*}\Delta_i\,\mathbb{E}[N_i(T)].$ 3

The induced posted-price map is

$R_T=\mathbb{E}\!\left[\sum_{t=1}^T(\mu^*-\mu_{A_t})\right] =\sum_{i\neq i^*}\Delta_i\,\mathbb{E}[N_i(T)].$ 4

with

$R_T=\mathbb{E}\!\left[\sum_{t=1}^T(\mu^*-\mu_{A_t})\right] =\sum_{i\neq i^*}\Delta_i\,\mathbb{E}[N_i(T)].$ 5

The paper states that this map decomposes into the true best response

$R_T=\mathbb{E}\!\left[\sum_{t=1}^T(\mu^*-\mu_{A_t})\right] =\sum_{i\neq i^*}\Delta_i\,\mathbb{E}[N_i(T)].$ 6

plus an upward bias term driven by $R_T=\mathbb{E}\!\left[\sum_{t=1}^T(\mu^*-\mu_{A_t})\right] =\sum_{i\neq i^*}\Delta_i\,\mathbb{E}[N_i(T)].$ 7 (Baek et al., 15 May 2026).

The equilibrium benchmarks are

$R_T=\mathbb{E}\!\left[\sum_{t=1}^T(\mu^*-\mu_{A_t})\right] =\sum_{i\neq i^*}\Delta_i\,\mathbb{E}[N_i(T)].$ 8

If the exploration means lie in the best-response cones

$R_T=\mathbb{E}\!\left[\sum_{t=1}^T(\mu^*-\mu_{A_t})\right] =\sum_{i\neq i^*}\Delta_i\,\mathbb{E}[N_i(T)].$ 9

then the ODE yields supra-competitive terminal prices above $\sum_{i\neq i^*} 2/\Delta_i$ 0 under the conditions stated in the paper (Baek et al., 15 May 2026). Under symmetric exploration $\sum_{i\neq i^*} 2/\Delta_i$ 1 and $\sum_{i\neq i^*} 2/\Delta_i$ 2, the limit result implies that exploring clustered below Nash or clustered above monopoly pushes the committed dynamic to monopoly-level prices, while exploring in the band between Nash and monopoly locks in that band (Baek et al., 15 May 2026).

Simulations calibrated to a Boston multifamily rental market further show that supra-competitive convergence is frequent across a wide range of exploration means, dispersions, and horizons; prices are closest to Nash when exploration is clustered around the Nash vector, while exploration clustered away from Nash produces a U-shaped pattern in terminal rents, with upper tails approaching monopoly levels (Baek et al., 15 May 2026). A plausible implication is that DETC in this institutional sense is less a performance-optimizing learning design than a market-structure hazard when misspecification and synchronized commitment interact.

6. Terminological scope, misconceptions, and research directions

The literature uses the same label for distinct but structurally related ideas. In bandits, DETC is a specific non-fully-sequential algorithmic family with two exploration and two commitment stages and asymptotically optimal regret guarantees (Jin et al., 2020). In decentralized matching, it names a double-sided exploration-and-commit architecture implemented by CA-ETC (Pagare et al., 2024). In algorithmic pricing, it is the multi-firm institutional analogue of explore-then-exploit, where multiple firms enter the commit phase simultaneously under misspecified demand estimation (Baek et al., 15 May 2026). The shared kernel is separation between exploratory data collection and a subsequent commitment regime; the operational meaning of “double” changes across domains.

One common misunderstanding is to treat DETC as universally superior to fully sequential methods. The evidence is domain-specific. In bandits, the specialized 2020 DETC construction matches asymptotically optimal constants while keeping constant round complexity in batched variants (Jin et al., 2020). But the Gaussian ETC lower-bound analysis shows that any policy with a finite final commitment time and no corrective structure beyond that remains fundamentally suboptimal by a factor of 2 in the leading logarithmic constant (Garivier et al., 2016). The positive result is therefore architectural, not purely terminological.

A second misconception is that DETC is intrinsically benign because it lacks explicit coordination. The pricing results show the opposite: no explicit punishment, communication, or synchronization beyond common exploration and simultaneous commitment is needed for misspecified learning dynamics to generate sustained supra-competitive prices (Baek et al., 15 May 2026). The mechanism is covariance-driven omitted-variable bias, not overt collusion (Baek et al., 15 May 2026).

Across these literatures, open directions are explicit. In pricing, the stated directions include multi-period strategic interactions beyond myopic commitment, alternative learning algorithms such as policy-gradient and actor-critic under misspecification, information sharing and transparency policies, and broader supermodular games beyond pricing (Baek et al., 15 May 2026). In matching, the results emphasize the cost of achieving decentralization and two-sided learning without communication or structural assumptions, suggesting further work on sharper lower bounds and alternative communication primitives (Pagare et al., 2024). In bandits, the remaining limitation identified in the 2020 paper is that simultaneous instance-optimal and asymptotically optimal DETC for general $\sum_{i\neq i^*} 2/\Delta_i$ 3-armed problems is left open, even though asymptotic optimality alone is established (Jin et al., 2020).

Taken together, DETC is best understood as a design pattern with sharply different consequences across settings. In stochastic bandits it can recover optimal asymptotic efficiency without full sequential adaptivity (Jin et al., 2020). In decentralized matching it organizes symmetric two-sided learning and stable commitment under stringent communication constraints (Pagare et al., 2024). In multi-firm pricing, the same structural separation between exploration and commitment can transform estimation misspecification into systematic price elevation above Nash and, under symmetric exploration, toward monopoly levels (Baek et al., 15 May 2026).