AdamCB: Combinatorial Bandit Sampling in Adam
- The paper introduces AdamCB, integrating Adam’s adaptive learning rates with combinatorial bandit sampling to focus on high-gradient samples for faster convergence.
- It employs DepRound for selecting K distinct samples per iteration, thereby reducing gradient estimation variance and achieving robust theoretical regret bounds.
- Empirical evaluations on MNIST, Fashion-MNIST, and CIFAR-10 validate AdamCB’s superior stability and accelerated convergence compared to uniform-sampling optimizers.
Adam with Combinatorial Bandit Sampling (AdamCB) is an advanced stochastic optimization algorithm integrating the adaptive learning-rate strategy of Adam with a combinatorial bandit-based batch selection mechanism. By adaptively focusing on data points with large gradient norms, AdamCB achieves improved convergence rates and empirical performance in deep learning training compared to standard uniform-sampling optimizers or previous bandit-based Adam variants. Combinatorial bandit mechanisms process feedback from multiple samples simultaneously, adjusting the sampling distribution to favor informative examples and reducing the variance of gradient estimates (Kim et al., 7 Dec 2025).
1. Algorithmic Structure and Workflow
AdamCB augments the classical Adam update by introducing an adaptive non-uniform mini-batch selection scheme governed by combinatorial bandit feedback. At each iteration , the mechanism proceeds as follows:
- Weight Maintenance: Each training sample is assigned a weight , initialized to one.
- Probability Assignment: Weights are transformed into probabilities so that their sum equals the batch size ; weights exceeding a cap are treated via a null-set and thresholds.
- Combinatorial Batch Sampling: Using DepRound, distinct indices are selected without replacement, honoring the target marginal probabilities .
- Gradient Estimation: For each sampled , the per-example gradient is computed. The unbiased batch-gradient is constructed as .
- Adam Moment Updates: Standard Adam steps with adaptive first/second moment estimation, bias correction, and parameter updates are performed.
- Weight Update: Observed gradient magnitudes produce a “loss” for each sample: if was sampled, $0$ otherwise. Weights are exponentially reweighted via , driving future sampling preference toward samples with high gradients.
The entire process is summarized in the following pseudocode, utilizing established notation for Adam parameters (see section details for explicit forms):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
Input: n, K, η, {α_t}, β_{1,t}, β_2, ε > 0
Initialize θ_0 ∈ ℝ^d, m_0 ← 0, v_0 ← 0, w_{i,0} ← 1 ∀ i
for t = 1…T:
# Batch Selection
Adjust weights, enforce thresholds, and form probabilities p_{i,t}
J_t ← DepRound({p_{i,t}}, K)
# Gradient Estimate
For each i ∈ J_t, compute g_{i,t} = ∇_θ ℓ(θ_{t-1}; x_i, y_i)
g_t = (1/K) Σ_{i∈J_t} [g_{i,t} / p_{i,t}]
# Adam Moment Updates
m_t ← β_{1,t} m_{t-1} + (1–β_{1,t}) g_t
v_t ← β_2 v_{t-1} + (1–β_2) g_t ⊙ g_t
m̂_t ← m_t/(1–β_{1,t}^t), v̂_t ← max(v̂_{t-1}, v_t/(1–β_2^t))
# Parameter Update
θ_t ← θ_{t-1} – α_t m̂_t / (√{v̂_t} + ε)
# Weight Update
For each i ∈ [n]:
ℓ_{i,t} = (‖g_{i,t}‖_2^2) / (p_{i,t}^2) if i ∈ J_t else 0
w_{i,t} ← w_{i,t-1} · exp(–(ηK/n) ℓ_{i,t})
Return θ_T |
2. Combinatorial Bandit Framework
AdamCB generalizes multi-armed bandit (MAB) sampling paradigms to a combinatorial regime. Rather than pulling a single arm, the algorithm chooses a subset of arms (samples) per round. Bandit feedback is collected for each sampled element, and the batch-selection mechanism is designed to minimize cumulative regret relative to the best static subset of examples.
Key technical ingredients:
- Fractional Allocation: Probabilities are computed from weights as , capped by threshold and adjusted so that .
- Dependent Rounding (DepRound): Guarantees that selection satisfies marginal constraints , with all selected indices distinct and sampled in time .
- Semi-bandit Feedback: Loss for each arm (sample) observed is .
- EXP3-Style Update: Weights updated by , aligning future probabilities with informative samples.
The framework enables AdamCB to efficiently process feedback from all batch elements, not just single samples, and thereby capitalize on the variance reduction and adaptive sampling advantages of the combinatorial setting.
3. Regret Bounds and Convergence Analysis
The theoretical foundation for AdamCB centers on analysis of cumulative regret with respect to the best parameter choice for the whole dataset. AdamCB achieves improved convergence rates over both uniform-sampling Adam and the single-arm bandit variant AdamBS (Liu et al., 2020).
- Assumptions: (A1) Bounded gradient norms ; (A2) bounded parameter drift .
- Main Regret Bound:
where is the parameter dimension, is the gradient bound, is the learning rate, is the number of rounds, the dataset size, and the batch size.
- Interpretation: The term aligns with uniform Adam's regret. The term embodies the benefit of combinatorial bandit feedback, yielding an exponent improvement in over conventional or non-combinatorial bandit schemes and further enhancement with increasing .
Supporting lemmas (see (Kim et al., 7 Dec 2025)) delineate online regret decomposition and combinatorial semi-bandit bounds, ensuring that AdamCB's unbiased gradient estimator and adaptive sampling substantially bolster convergence, especially as dataset size and batch size scale.
A plausible implication is that for heterogeneous data, AdamCB can outperform fixed-importance or single-arm bandit sampling by leveraging sample selection diversity and variance reduction.
4. Empirical Evaluation
Extensive experiments validate AdamCB’s performance across multiple datasets and architectures:
| Dataset | Model Architecture | Relative Performance |
|---|---|---|
| MNIST | MLP (512,256 units) | Fastest training and test loss decay |
| Fashion-MNIST | CNN (2 blocks + FC) | Fastest convergence |
| CIFAR-10 | VGG-style (16 layers) | Best loss after 50 epochs |
- Baselines: Adam (uniform sampling), AMSGrad, AdamX (regret-corrected Adam), AdamBS (bandit sampling).
- Hyperparameters: , , , , , batch size .
AdamCB consistently demonstrates accelerated convergence over all tested methods, especially notable in scenarios with pronounced sample heterogeneity. Non-combinatorial bandit sampling (AdamBS) exhibits instability and performance variance, which AdamCB mitigates.
Ablation analyses further show sensitivity to batch size and the exploration parameter . AdamCB’s advantage grows with larger , whereas single-arm bandit methods do not benefit from increased batch size. The optimal is observed around 0.4 with mild overall sensitivity.
5. Practical Considerations and Implications
AdamCB is designed for large-scale and heterogeneously informative datasets.
- Computational Overhead: DepRound batch selection and weight updates are per iteration, representing negligible overhead compared to deep neural model forward/backward passes; total wall-clock increase is bounded by 10%.
- Scalability: The architecture supports datasets with millions of samples through efficient weight and allocation management. All operations are trivially parallelizable across CPU/GPU resources.
- Application Scenarios: AdamCB is especially suitable when
- Gradient magnitude varies markedly across data points (e.g., imbalanced or noisy data)
- Large-batch stochastic approximation is desired but variance control is critical
- Theoretical guarantees combining bandit-based regret and Adam-style convergence are required
- Mechanistic Advantage: Focusing computation on examples with substantial gradients directs optimization resources to challenging or informative instances, thus reducing variance and accelerating descent. The algorithm’s theoretical improvement in convergence exponents with respect to , , and translates into practical gains as demonstrated in the experimental suite.
This suggests AdamCB is particularly advantageous in deep learning contexts characterized by class imbalance, long-tailed data, or regimes in which efficient, variance-controlled optimization is essential.
6. Relationship to Previous Bandit-Based Adam Variants
AdamCB extends and improves upon the non-combinatorial bandit sampling approach of AdamBS (“Adam with Bandit Sampling for Deep Learning,” Liu et al.) (Liu et al., 2020), addressing its limited theoretical guarantees and instability with rigorous combinatorial mechanisms. Unlike AdamBS—which samples batch members independently with replacement and uses single-arm feedback—AdamCB employs DepRound for dependent batch selection, aggregates feedback over subsets, and leverages combinatorial semi-bandit theory for regret bound enhancements. This is reflected both in the refined convergence rate (from or under AdamBS to or under AdamCB in key regimes) and in superior empirical performance.
AdamCB thereby unifies the adaptive moment estimation of Adam with a theoretically robust combinatorial batch selection protocol, yielding performance enhancements substantiated both analytically and experimentally (Kim et al., 7 Dec 2025).