Symmetric Bandits

Updated 25 December 2025

Symmetric bandits are decision frameworks that exploit balanced noise, invariant reward functions, or inter-arm symmetries to optimize sequential decision-making.
They utilize specialized methods such as clipped-SGD-UCB, α-Thompson Sampling, and group-invariant approaches to handle heavy-tailed and structurally complex scenarios.
By harnessing symmetry, these models achieve near-sub-Gaussian regret bounds and scalability, making them effective in high-dimensional and resource-constrained environments.

Symmetric bandits are classes of sequential decision problems in which the reward structure, the observation process, or the constraints exhibit symmetry under specified transformations or relationships between arms, noise, or parameters. Symmetry can manifest in the reward distributions (as in symmetric heavy-tailed or stable laws), in the reward function (as invariance to group actions), in the structure of inter-arm influences, or in resource constraints (as in knapsacks). Leveraging such symmetry can dramatically affect regret optimality, algorithmic design, problem complexity, and theoretical rates.

1. Symmetric Noise Models in Stochastic Multi-Armed Bandits

Symmetric noise models constitute a principal setting in bandit theory where the reward process for each arm admits i.i.d. additively symmetric noise. Specifically, given $K$ arms, pulling arm $i$ at time $t$ yields

$X_{i}^{t} = \mu_{i} + \xi_{i}^{t},$

where $\mu_{i} \in \mathbb{R}$ is the mean reward, and the noise variables $\xi_{i}^{t}$ are i.i.d. with density $\rho(u) = \rho(-u)$ and prescribed heavy-tailedness, e.g., $\mathbb{E}[|\xi_{i}^{t}|^{1+\alpha}] \leq \sigma^{1+\alpha}$ for some $\alpha>0$ or, in extreme regimes, even with $\alpha<0$ so that no positive moments are finite.

A central insight is that the symmetry of $\xi_{i}^{t}$ can be exploited for sharp concentration, even when the noise is super-heavy-tailed. In particular, by pairing this structure with carefully designed estimators (e.g., clipped stochastic gradient descent, or SGD), one can achieve instance-wise and worst-case regret bounds that scale as

$O\left(\ln T \sqrt{KT\ln T}\right),$

instead of the slower, heavy-tail-limited rates $O\left(T^{1/(1+\alpha)} K^{\alpha/(1+\alpha)}\right)$ . Notably, these rates match sub-Gaussian bounds up to logarithmic factors and remain valid even when the noise law lacks an expectation, provided only symmetry remains (Dorn et al., 2024).

Clipped-SGD-UCB algorithms, accordingly, combine first-order convex-optimization-based UCB with robust gradient estimation under symmetric noise. They update arm estimates through clipped SGD steps, using adaptive step-sizes and exponentially decaying clipping thresholds, ultimately producing UCB indices based on the inexact optimization suboptimality. Theoretical bounds then follow from high-probability guarantees on SGD convergence, which are enabled by the symmetry but do not require higher moments.

2. Symmetric α-Stable and Other Heavy-Tailed Bandit Models

Symmetry plays a crucial role when arms’ rewards are governed by symmetric $\alpha$ -stable distributions $S_\alpha(0, \sigma, \mu)$ , with characteristic exponent $\alpha \in (0,2]$ , particularly notable for $1 < \alpha < 2$ where means exist but variances do not. Such distributions model extreme-event-dominated environments, including financial returns, human behavioral data, and network traffic (Dubey et al., 2019).

In the symmetric $\alpha$ -stable bandit, the reward process is stable in aggregation:

For $n$ i.i.d. pulls, the empirical mean remains $\alpha$ -stable with rescaled dispersion,
Only moments of order $p < \alpha$ are defined, with heavier tails as $\alpha \to 1$ .

Algorithmic frameworks such as $\alpha$ -Thompson Sampling (α-TS) adapt Bayesian posterior sampling to such settings. By leveraging the symmetric scale-mixture-of-normals (SMiN) representation, posterior updates can be sampled efficiently and allow for regret analysis. Regret bounds for α-TS scale polynomially in time: $\mathrm{BR}(T) = O\left(K^{1/(1+\epsilon)} T^{(1+\epsilon)/(1+2\epsilon)}\right)$ for $\epsilon \in (0, \alpha-1)$ , while a robustified version using truncated means matches the minimax rate up to logarithmic factors,

$\widetilde{O}\left((KT)^{1/(1+\epsilon)}\right).$

Empirical comparisons confirm superior robustness to heavy tails relative to UCB or naive Gaussian-based methods (Dubey et al., 2019).

3. Symmetry Through Invariance, Group Actions, or Duality

Symmetry in bandits also arises from reward invariance under known or unknown group actions. In high-dimensional linear bandits, "hidden symmetry" describes settings where the reward is invariant to unspecified transformations—formally, for some subgroup $\mathcal{G}$ of the permutation group $S_d$ , $f(x) = \mathbb{E}[y|x] = \langle x, \theta_\star \rangle$ is invariant under $g \cdot x$ for all $g \in \mathcal{G}$ , and the unknown parameter $\theta_\star$ lies in the fixed-point subspace

$\mathrm{Fix}_{\mathcal G} = \{ \theta \in \mathbb{R}^{d} : A_{g} \theta = \theta, \forall g \in \mathcal{G}\}.$

Model-selection-based algorithms can learn the correct symmetry online, achieving $O\left(d_0^{2/3} T^{2/3} \log(d)\right)$ regret, and, under well-separated group partitions, $O(d_0 \sqrt{T\log d})$ , with $d_0 \ll d$ (Tran et al., 2024).

Similarly, in continuum-armed bandits over metric spaces, if $f(x)$ is invariant under a finite group $G$ of isometries ( $f(g \cdot x) = f(x)$ for all $g \in G$ ), side-observation algorithms such as UniformMesh-N exploit orbits to achieve minimax regret scaled as $|G|^{-1/(d+2)} N^{(d+1)/(d+2)}$ —a power improvement over non-symmetric Lipschitz bandits (Tran et al., 2022).

Symmetry by duality appears in constrained bandit problems such as bandits with knapsacks (BwK), where resource allocation is modeled via coupled linear programs. Primal and dual LPs exhibit a symmetry between arms and knapsacks (resources): removing an arm or imposing a penalty on a knapsack’s leftover can be analyzed through leave-one-out/penalized LPs that admit analogous slackness measures. This duality enables a two-phase algorithm to achieve the first problem-dependent logarithmic regret bound for general BwK (Li et al., 2021).

4. Influence, Interaction, and Structured Symmetry

In advanced bandit settings, symmetry may govern the structure of temporal or inter-arm dependencies. The influential bandit framework assumes that the environment evolves through a symmetric positive semidefinite (PSD) interaction matrix $A = A^{\top} \succeq 0$ : $l^{(t+1)}_{j} = l^{(t)}_{j} + A_{i^{(t)}, j}$ for arm $j$ when arm $i^{(t)}$ is pulled at time $t$ . The expected cumulative loss is then

$\mathcal{L}(n) = n^{\top} l^{(1)} + \frac{1}{2} n^{\top} A n,$

where $n$ is the vector of arm pull counts. The symmetry ensures that the order of pulls does not matter, only the pull counts. Standard UCB or LCB algorithms incur superlinear regret ( $\Omega(T^{2}/ \log^{2} T)$ ) due to their failure to exploit symmetric structure, whereas specialized lower-confidence algorithms that account for the convexity and symmetry in $A$ achieve nearly optimal $O(K T \log T)$ regret, provided $A$ remains symmetric (Sato et al., 11 Apr 2025).

Such symmetry enables the use of convex-analytic tools (Frank–Wolfe analysis) for regret decomposition and guarantees, highlighting the critical leverage provided by structural symmetry.

5. Symmetric Two-Armed Bandits: Minimax and Information-Theoretic Analysis

The simplest instance of symmetric bandits—the two-armed symmetric Bernoulli bandit with $\mu_1 + \mu_2 = 1$ —allows for explicit characterization of optimal regret. In this scenario, the myopic policy that selects the arm appearing better "so far" is provably minimax-optimal. The regret analysis reduces to a parabolic PDE (linear heat equation) with discontinuous source, which delivers sharp asymptotics and non-asymptotic bounds:

For vanishing gap ( $\Delta \ll T^{-1/2}$ ): $R_T^* \sim 0.564 \sqrt{T}$
For large gap ( $\Delta \gg T^{-1/2}$ ): $R_T^* \sim 1 / \Delta$

This collapse of exploration-exploitation tradeoff is a direct consequence of the symmetry constraint on the reward structure, which permits a collective state-variable reduction (Kobzar et al., 2022).

Information-directed sampling (IDS) policies in symmetric two-armed bandits also coincide with this myopic/greedy rule: due to exchangeability of information gain across arms, IDS always selects the optimal arm w.r.t. the current posterior. The total discounted regret for IDS in the symmetric setting is uniformly bounded, $\mathcal{R}^* = 1 / (2\delta)$ for bias $\delta = 2 \theta - 1$ , while UCB and Thompson sampling without symmetry-awareness incur logarithmic regret unless they are explicitly modified to encode the structure (Hirling et al., 23 Dec 2025).

6. Broader Impact and Applied Domains

Symmetries in bandit problems—whether in noise, reward function, model structure, or constraints—enable algorithms to:

Achieve regret rates otherwise possible only under stronger tail conditions or model assumptions
Reduce the effective dimension or complexity of inference
Exploit structured feedback, side observations, or duality for resource allocation

Applications include heavy-tailed finance and economics modeling, robust machine learning under outliers or heavy tails, resource-constrained decision processes, high-dimensional inference under permutation or group invariance, and online recommendation systems with structured interactions. In all such problems, explicit or hidden symmetry, when recognized and exploited, affords substantial theoretical and practical advantages.

Table: Key Symmetric Bandit Models and Guarantees

Symmetry Type	Core Problem Structure	Regret Bound (Representative)
Symmetric heavy-tailed noise	i.i.d. noise $\rho(u) = \rho(-u)$	$O(\ln T \sqrt{KT\ln T})$ (Dorn et al., 2024)
Symmetric $\alpha$ -stable	$S_\alpha(0,\sigma,\mu)$ , $1<\alpha<2$	$O(K^{1/(1+\epsilon)}T^{(1+\epsilon)/(1+2\epsilon)})$
Linear bandits w/ hidden symmetry	$f(x)$ invariant under unknown group	$O(d_0^{2/3}T^{2/3}\log d)$ , $O(d_0\sqrt{T\log d})$
Invariant Lipschitz bandits	$f(g \cdot x) = f(x)$ under $G$	$\Theta(\|G\|^{-1/(d+2)}N^{(d+1)/(d+2)})$
Influential bandits	Loss update via symmetric PSD $A$	$O(KT\log T)$ (Sato et al., 11 Apr 2025)
Symmetric 2-armed Bernoulli	$\mu_1+\mu_2=1$ constraint	$R_T^*\sim 0.564\sqrt T$ (small gap) (Kobzar et al., 2022)

7. Theoretical and Practical Significance

Symmetric bandits underscore the advantage of tailoring algorithmic strategies to underlying structure. Theoretical analysis reveals that symmetry can serve as a weaker, yet sufficient, substitute for higher-order moment bounds, yielding near-sub-Gaussian performance in settings where naive algorithms would otherwise fail or be suboptimal. Practical algorithms—clipped-SGD-UCB, $\alpha$ -Thompson Sampling, group-invariant mesh methods, dual-formulation resource allocators, and symmetry-exploiting LCB algorithms—demonstrate the translation of symmetry from abstract model property to tangible statistical leverage across a broad range of domains.