Adaptive Span Generator Overview

Updated 13 December 2025

Adaptive Span Generator is a mechanism that adaptively selects variable-length regions to focus computational or measurement resources, applicable in diverse domains.
In quantum variational algorithms, it reduces sampling by early elimination of suboptimal candidates, achieving measurement reductions of up to 93% while maintaining accuracy.
In zero-shot video retrieval and image matching, ASG efficiently limits candidate spans and localizes attention through data-driven uncertainty estimates, enhancing speed and precision.

An Adaptive Span Generator (ASG) denotes a family of mechanisms that adaptively define or select variable-length regions, spans, or attention windows within a larger search or computation space. Such mechanisms have been independently developed and leveraged in several research domains, most notably quantum variational algorithms, zero-shot video moment retrieval, and adaptive attention for image matching. Despite differences in instantiation, they share the objective of focusing computational or measurement resources onto the most promising subregions—thus offering major reductions in measurement or computational budget without sacrificing the quality of the solution.

1. Adaptive Span Generator in Quantum Variational Algorithms

In adaptive variational quantum algorithms, for example ADAPT-VQE, the ASG conceptually emerges during generator selection, where the iterative ansatz construction requires identifying the operator (generator) $\tau_i$ from a candidate pool $\mathcal{A} = \{\tau_1,\ldots,\tau_k\}$ with the maximal energy gradient. The energy gradient is given by $g_i = \langle \psi(\theta) | [H, \tau_i] | \psi(\theta) \rangle$ and is estimated by repeated quantum measurements over decomposed fragments $A_n^{(i)}$ (Huang et al., 18 Sep 2025).

The identification of the optimal generator is recast as a Best Arm Identification (BAI) problem in the multi-armed bandit (MAB) formalism, where each candidate $\tau_i$ is considered an arm with an unknown mean "reward" $g_i$ . Noisy estimates are observed due to quantum sampling shot noise. The goal is to find $i^* = \arg\max_i g_i$ with high confidence while minimizing the total sample count.

The Successive Elimination (SE) algorithm is employed: in each round, a batch of measurements is allocated to each surviving candidate, empirical means and their confidence intervals are updated using sub-Gaussian concentration, and candidates whose upper confidence bound falls below the maximal lower bound are eliminated. This approach concentrates measurements on promising candidates, early-eliminates suboptimal operators, and provides sample complexity bounds $n_i = O((\sigma^2/\Delta_i^2)\ln(k/\delta))$ , where $\Delta_i$ is the gap to the true maximum (Huang et al., 18 Sep 2025).

Numerical experiments on molecular systems (H₄, LiH, BeH₂ using UCCSD, qubit, and qubit-excitation pools) demonstrate measurement reductions of 69–93% (depending on the pool) to reach chemical accuracy without compromising final ground-state energy, with the benefit persisting as system size increases.

2. Adaptive Span Generator for Zero-Shot Long Video Moment Retrieval

In long-video moment retrieval (LVMR), especially under zero-shot conditions, the search for relevant temporal segments using a natural language query faces a candidate explosion when employing naive fixed-window or sliding-stride approaches. For a video of $T$ frames, window size $W$ , and stride $S$ , candidate count scales as $(T-W)/S$ , resulting in thousands of spans and overwhelming compute/memory costs (Jeon et al., 11 Dec 2025).

The ASG, as featured in the Point-to-Span (P2S) framework, addresses this explosion by detecting "peaks" in the similarity signal $S_o$ between the query embedding and video features. Around each peak, an adaptive expansion forms variable-length candidate spans. The expansion width is determined using a signal-adaptive ratio $\tau_r = 0.5 + 0.5\cdot\operatorname{sigmoid}(\sigma_S)$ where $\sigma_S$ is the standard deviation of $S_o$ , and boundaries are set by thresholding the refined (smoothed) similarity signal $\hat{S}(t)$ at an adaptively scaled level $\tau_{\mathrm{expand}} = \hat{S}(p)\cdot\tau_r$ .

Algorithmically, ASG runs in $O(T)$ time, involves cumulative-sum smoothing, peak finding (e.g., with scipy.signal.find_peaks), and threshold-based expansion per peak. Empirically, for 1h, $\sim$ 18,000-frame videos, the approach reduces candidates by an order of magnitude (>10 $\times$ speed-up), requiring $<0.05$ s for span generation and $<0.60$ s for the whole pipeline (excluding LLM calls) (Jeon et al., 11 Dec 2025).

Ablation studies demonstrate that ASG’s performance is competitive with dataset-specific tuned window methods but is robust across datasets and use cases, obviating per-dataset hyperparameter tuning. On MomentSeeker, ASG achieves an average 36.51% on R1/R5, outperforming prior zero-shot and many supervised baselines.

3. Adaptive Span Generator in Detector-Free Image Matching

In ASpanFormer for image matching (Chen et al., 2022), the ASG operates within hierarchical attention modules that align image features without detectors. At each cross-attention block, a flow-map $\phi \in \mathbb{R}^{H\times W\times 4}$ is regressed from the source feature map. For each pixel $(i,j)$ , a small MLP predicts $(u_x^{ij},u_y^{ij},\sigma_x^{ij},\sigma_y^{ij})$ —the center and log-scale “uncertainties”—where $(u_x^{ij},u_y^{ij})$ define the center for attention and $(\sigma_x^{ij},\sigma_y^{ij})$ the spatial span.

The actual attention window’s half-axes are set as $r_x^{ij} = n\cdot \sigma_x^{ij}$ , $r_y^{ij} = n\cdot \sigma_y^{ij}$ for "span coefficient" $n$ (typically $n=5$ ). For each pixel or cell, a $w\times w$ uniform grid samples keys/values within the adaptively-sized rectangle, enabling attention over localized, uncertainty-driven contexts. Matching is computed via softmax-attention within this dynamically predicted region, followed by hierarchical aggregation (global and local) over pyramid levels.

Compared to fixed-span approaches, which have $O(HW R^2 C)$ cost for $R$ -pixel radius and $C$ channels, the ASG keeps the local attention cost at $O(HW w^2 C)$ via constant-sized adaptive grids. In practice, ASpanFormer’s attention adds $\sim$ 16 ms overhead (total $\sim$ 113 ms) versus LoFTR's $\sim$ 98 ms but yields a $+1.6$ AUC@5° gain on ScanNet (Chen et al., 2022).

Extensions are plausible for video correspondence (motion-uncertainty), point cloud registration (alignment-uncertainty), and NLP sequence attention (per-token span-need).

4. Comparative Table of ASG Algorithms Across Domains

Domain/Task	Key Mechanism	Resource Reduction Mechanism
Quantum Variational Algorithms	Successive Elimination on measurement arms (generators)	Focused measurements, early elim.
Long Video Moment Retrieval	Adaptive peak-finding and expansion in similarity signal	Reduced candidate spans, O(T) speed
Image Matching (ASpanFormer)	Local window size from regressed pixel uncertainty	Localized adaptive attention regions

This comparative summary illustrates the unifying theme: spanning and focusing mechanisms, dynamically determined by data-driven uncertainty or salience, resulting in resource-efficient inference.

5. Implementation and Efficiency Considerations

In quantum settings, noise-robust confidence intervals (Hoeffding/sub-Gaussian) are used, and quantum hardware remains practical due to shot-resource focusing and elimination of unpromising arms after preliminary measurements (Huang et al., 18 Sep 2025). In long videos, ASG transforms a superlinear search over spans into a near-linear bottleneck, with no hyperparameter tuning required for variable dataset statistics (Jeon et al., 11 Dec 2025). For image matching, the sampling grid for local attention is adaptively expanded or contracted per-pixel, supporting piece-wise smoothness and robust matching under local uncertainty without explosion of computational cost (Chen et al., 2022).

Integration with pipelines is direct: in variational quantum eigensolver (VQE), selection and parameter update alternate after each generator is chosen. In video retrieval, candidate spans from ASG are subject to multi-evidence refinement and query decomposition. In image matching, adaptive local attention melds with hierarchical transformer architectures.

6. Extensions and Generalizations

The essential motif of the Adaptive Span Generator—regressing a relevant coordinate (center) and an uncertainty (span), then attending or searching only within the so-defined region—is applicable across domains:

Spatiotemporal attention in video models via motion-uncertainty prediction.
Adaptive neighborhood sampling for point cloud graph attention via alignment-uncertainty.
Sequence models predicting per-token "span-need" gating for efficient self-attention.
Adaptive spatial context in image segmentation using per-pixel segmentation uncertainty.

This generalization suggests that ASG frameworks can be ported wherever fine-grained focus of computational resources is advantageous and uncertainty estimates can be predicted or estimated.

7. Significance and Empirical Outcomes

Empirical evidence consistently demonstrates the utility of ASG methodologies:

In quantum simulation, generator-selection measurement counts were reduced by 69–93% without loss of accuracy as system size increased (Huang et al., 18 Sep 2025).
For zero-shot moment retrieval, candidate reduction exceeded a 10 $\times$ speedup, eliminating the need for sliding window heuristics and outperforming previous supervised and zero-shot methods on long video metrics (Jeon et al., 11 Dec 2025).
In image matching, adaptive attention due to ASG contributed to state-of-the-art correspondence accuracy with modest overhead (Chen et al., 2022).

A plausible implication is that ASG-style adaptations are broadly impactful for any domain where focused selection or efficient local attention is a limiting computational or measurement cost. The technique’s flexibility in handling uncertainty and salience estimation points to its continued applicability in future adaptive inference frameworks across computational science.