AdamCB: Combinatorial Bandit Sampling in Adam

Updated 14 December 2025

The paper introduces AdamCB, integrating Adam’s adaptive learning rates with combinatorial bandit sampling to focus on high-gradient samples for faster convergence.
It employs DepRound for selecting K distinct samples per iteration, thereby reducing gradient estimation variance and achieving robust theoretical regret bounds.
Empirical evaluations on MNIST, Fashion-MNIST, and CIFAR-10 validate AdamCB’s superior stability and accelerated convergence compared to uniform-sampling optimizers.

Adam with Combinatorial Bandit Sampling (AdamCB) is an advanced stochastic optimization algorithm integrating the adaptive learning-rate strategy of Adam with a combinatorial bandit-based batch selection mechanism. By adaptively focusing on data points with large gradient norms, AdamCB achieves improved convergence rates and empirical performance in deep learning training compared to standard uniform-sampling optimizers or previous bandit-based Adam variants. Combinatorial bandit mechanisms process feedback from multiple samples simultaneously, adjusting the sampling distribution to favor informative examples and reducing the variance of gradient estimates (Kim et al., 7 Dec 2025).

1. Algorithmic Structure and Workflow

AdamCB augments the classical Adam update by introducing an adaptive non-uniform mini-batch selection scheme governed by combinatorial bandit feedback. At each iteration $t$ , the mechanism proceeds as follows:

Weight Maintenance: Each training sample $i \in [n]$ is assigned a weight $w_{i,t-1}$ , initialized to one.
Probability Assignment: Weights are transformed into probabilities $\mathbf{p}_t = (p_{1,t}, ..., p_{n,t})$ so that their sum equals the batch size $K$ ; weights exceeding a cap are treated via a null-set and thresholds.
Combinatorial Batch Sampling: Using DepRound, $K$ distinct indices are selected without replacement, honoring the target marginal probabilities $p_{i,t}$ .
Gradient Estimation: For each sampled $i \in J_t$ , the per-example gradient $g_{i,t}$ is computed. The unbiased batch-gradient is constructed as $g_t = \frac{1}{K} \sum_{i \in J_t} \frac{g_{i,t}}{p_{i,t}}$ .
Adam Moment Updates: Standard Adam steps with adaptive first/second moment estimation, bias correction, and parameter updates are performed.
Weight Update: Observed gradient magnitudes produce a “loss” for each sample: $\ell_{i,t} = \|g_{i,t}\|_2^2 / p_{i,t}^2$ if $i$ was sampled, $0$ otherwise. Weights are exponentially reweighted via $w_{i,t} = w_{i,t-1} \exp(-(\eta K/n) \ell_{i,t})$ , driving future sampling preference toward samples with high gradients.

The entire process is summarized in the following pseudocode, utilizing established notation for Adam parameters (see section details for explicit forms):

Input: n, K, η, {α_t}, β_{1,t}, β_2, ε > 0
Initialize θ_0 ∈ ℝ^d, m_0 ← 0, v_0 ← 0, w_{i,0} ← 1 ∀ i
for t = 1…T:
    # Batch Selection
    Adjust weights, enforce thresholds, and form probabilities p_{i,t}
    J_t ← DepRound({p_{i,t}}, K)
    # Gradient Estimate
    For each i ∈ J_t, compute g_{i,t} = ∇_θ ℓ(θ_{t-1}; x_i, y_i)
    g_t = (1/K) Σ_{i∈J_t} [g_{i,t} / p_{i,t}]
    # Adam Moment Updates
    m_t ← β_{1,t} m_{t-1} + (1–β_{1,t}) g_t
    v_t ← β_2 v_{t-1} + (1–β_2) g_t ⊙ g_t
    m̂_t ← m_t/(1–β_{1,t}^t), v̂_t ← max(v̂_{t-1}, v_t/(1–β_2^t))
    # Parameter Update
    θ_t ← θ_{t-1} – α_t m̂_t / (√{v̂_t} + ε)
    # Weight Update
    For each i ∈ [n]:
        ℓ_{i,t} = (‖g_{i,t}‖_2^2) / (p_{i,t}^2) if i ∈ J_t else 0
        w_{i,t} ← w_{i,t-1} · exp(–(ηK/n) ℓ_{i,t})
Return θ_T

The combinatorial selection and weight updating are central to the adaptive focus on informative samples.

2. Combinatorial Bandit Framework

AdamCB generalizes multi-armed bandit (MAB) sampling paradigms to a combinatorial regime. Rather than pulling a single arm, the algorithm chooses a subset $J_t \subset [n]$ of $K$ arms (samples) per round. Bandit feedback is collected for each sampled element, and the batch-selection mechanism is designed to minimize cumulative regret relative to the best static subset of $K$ examples.

Key technical ingredients:

Fractional Allocation: Probabilities are computed from weights as $\widetilde p_{i,t} = (1-\eta) \frac{w_{i,t-1}}{\sum_j w_{j,t-1}} + \eta \frac{K}{n}$ , capped by threshold and adjusted so that $\sum_i p_{i,t} = K$ .
Dependent Rounding (DepRound): Guarantees that selection satisfies marginal constraints $\Pr(i \in J_t) = p_{i,t}$ , with all selected indices distinct and sampled in time $O(n)$ .
Semi-bandit Feedback: Loss for each arm (sample) observed is $\ell_{i,t} = \|g_{i,t}\|^2 / p_{i,t}^2$ .
EXP3-Style Update: Weights updated by $w_{i,t} = w_{i,t-1}\exp(-(\eta K / n) \ell_{i,t})$ , aligning future probabilities with informative samples.

The framework enables AdamCB to efficiently process feedback from all batch elements, not just single samples, and thereby capitalize on the variance reduction and adaptive sampling advantages of the combinatorial setting.

3. Regret Bounds and Convergence Analysis

The theoretical foundation for AdamCB centers on analysis of cumulative regret with respect to the best parameter choice for the whole dataset. AdamCB achieves improved convergence rates over both uniform-sampling Adam and the single-arm bandit variant AdamBS (Liu et al., 2020).

Assumptions: (A1) Bounded gradient norms $\|g_{i,t}\|\le L$ ; (A2) bounded parameter drift $\|θ_s-θ_t\|\le D$ .
Main Regret Bound:

$R_T = O(dLα\sqrt{T}) + O(dLα(nT\ln(n/K))^{1/4})$

where $d$ is the parameter dimension, $L$ is the gradient bound, $\alpha$ is the learning rate, $T$ is the number of rounds, $n$ the dataset size, and $K$ the batch size.

Interpretation: The $O(dLα\sqrt{T})$ term aligns with uniform Adam's regret. The $(nT\ln(n/K))^{1/4}$ term embodies the benefit of combinatorial bandit feedback, yielding an exponent improvement in $n$ over conventional or non-combinatorial bandit schemes and further enhancement with increasing $K$ .

Supporting lemmas (see (Kim et al., 7 Dec 2025)) delineate online regret decomposition and combinatorial semi-bandit bounds, ensuring that AdamCB's unbiased gradient estimator and adaptive sampling substantially bolster convergence, especially as dataset size and batch size scale.

A plausible implication is that for heterogeneous data, AdamCB can outperform fixed-importance or single-arm bandit sampling by leveraging sample selection diversity and variance reduction.

4. Empirical Evaluation

Extensive experiments validate AdamCB’s performance across multiple datasets and architectures:

Dataset	Model Architecture	Relative Performance
MNIST	MLP (512,256 units)	Fastest training and test loss decay
Fashion-MNIST	CNN (2 blocks + FC)	Fastest convergence
CIFAR-10	VGG-style (16 layers)	Best loss after 50 epochs

Baselines: Adam (uniform sampling), AMSGrad, AdamX (regret-corrected Adam), AdamBS (bandit sampling).
Hyperparameters: $\alpha = 10^{-3}$ , $\beta_1 = 0.9$ , $\beta_2 = 0.999$ , $\epsilon = 10^{-8}$ , $\eta = 0.4$ , batch size $K = 128$ .

AdamCB consistently demonstrates accelerated convergence over all tested methods, especially notable in scenarios with pronounced sample heterogeneity. Non-combinatorial bandit sampling (AdamBS) exhibits instability and performance variance, which AdamCB mitigates.

Ablation analyses further show sensitivity to batch size and the exploration parameter $\eta$ . AdamCB’s advantage grows with larger $K$ , whereas single-arm bandit methods do not benefit from increased batch size. The optimal $\eta$ is observed around 0.4 with mild overall sensitivity.

5. Practical Considerations and Implications

AdamCB is designed for large-scale and heterogeneously informative datasets.

Computational Overhead: DepRound batch selection and weight updates are $O(n)$ per iteration, representing negligible overhead compared to deep neural model forward/backward passes; total wall-clock increase is bounded by 10%.
Scalability: The architecture supports datasets with millions of samples through efficient weight and allocation management. All operations are trivially parallelizable across CPU/GPU resources.
Application Scenarios: AdamCB is especially suitable when
- Gradient magnitude varies markedly across data points (e.g., imbalanced or noisy data)
- Large-batch stochastic approximation is desired but variance control is critical
- Theoretical guarantees combining bandit-based regret and Adam-style convergence are required
Mechanistic Advantage: Focusing computation on examples with substantial gradients directs optimization resources to challenging or informative instances, thus reducing variance and accelerating descent. The algorithm’s theoretical improvement in convergence exponents with respect to $n$ , $K$ , and $T$ translates into practical gains as demonstrated in the experimental suite.

This suggests AdamCB is particularly advantageous in deep learning contexts characterized by class imbalance, long-tailed data, or regimes in which efficient, variance-controlled optimization is essential.

6. Relationship to Previous Bandit-Based Adam Variants

AdamCB extends and improves upon the non-combinatorial bandit sampling approach of AdamBS (“Adam with Bandit Sampling for Deep Learning,” Liu et al.) (Liu et al., 2020), addressing its limited theoretical guarantees and instability with rigorous combinatorial mechanisms. Unlike AdamBS—which samples batch members independently with replacement and uses single-arm feedback—AdamCB employs DepRound for dependent batch selection, aggregates feedback over subsets, and leverages combinatorial semi-bandit theory for regret bound enhancements. This is reflected both in the refined convergence rate (from $n^{-1/2}$ or $O(\sqrt{\log n/T})$ under AdamBS to $n^{-3/4}$ or $O((nT\ln(n/K))^{1/4})$ under AdamCB in key regimes) and in superior empirical performance.

AdamCB thereby unifies the adaptive moment estimation of Adam with a theoretically robust combinatorial batch selection protocol, yielding performance enhancements substantiated both analytically and experimentally (Kim et al., 7 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (2)

ADAM Optimization with Adaptive Batch Selection (2025)

Adam with Bandit Sampling for Deep Learning (2020)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Adam with Combinatorial Bandit Sampling (AdamCB).