Debiased Evidential Learning for Moment Retrieval
- The paper introduces DEMR, a framework that integrates evidential learning, Reflective Flipped Fusion, and query reconstruction to address cross-modal imbalance and miscalibrated uncertainties.
- The methodology employs a progressive fusion strategy and a geometric regularizer to align video-text features and ensure lower prediction error with increased evidential support.
- Empirical results on benchmarks like QVHighlights and Charades-STA show that DEMR outperforms conventional models, achieving higher retrieval accuracy and enhanced robustness under debiased evaluations.
Debiased Evidential Learning for Moment Retrieval (DEMR) is a framework designed for robust temporal localization of moments in videos based on natural language queries. It resolves foundational challenges in moment retrieval—most notably, cross-modal imbalance and miscalibrated uncertainty estimation—by integrating evidential learning, a progressive fusion strategy, and an auxiliary query-grounding task. DEMR improves both retrieval accuracy and calibration under standard and debiased benchmarks, providing enhanced temporal-semantic robustness (Huang et al., 30 Nov 2025).
1. Formal Problem Setup and Task Definition
The objective of moment retrieval is to accurately predict temporal segments within a video , where , conditioned on a textual query with . The ground truth is a segment specified as clip indices with . The system generates candidate segments , ranks them, and selects the top prediction by retrieval score.
The standard supervised loss comprises two terms over foreground clips (): $\mathcal{L}_{mr} = \sum_{i=1}^{L_v} \mathds{1}_{f_i=1} \left[ \lambda_{L1}\, \mathcal{L}_{\mathrm{SmoothL1}}\bigl(\tilde d_i, d_i\bigr) + \lambda_{\mathrm{iou}}\, \mathcal{L}_{\mathrm{gIoU}}(\tilde m_i, m_i) \right],$ where and .
2. Deep Evidential Regression Baseline and Its Limitations
The vanilla baseline for uncertainty-aware localization uses Deep Evidential Regression (DER), modeling the boundary under a Normal-inverse-Gamma prior: with evidential parameters output by a linear head. The marginal likelihood is Student-, yielding a negative log-likelihood training loss: and a heuristic uncertainty regularizer: ALEATORIC and EPISTEMIC uncertainties are, respectively, , . The total baseline loss (for samples) is:
However, two major issues emerge: (A) cross-modal (video-text) imbalance—where uncertainty is unduly sensitive to noise in the visual modality and (B) the uncertainty regularizer’s gradient depends solely on prediction error and ignores current evidence, causing pathological uncertainty assignments.
3. Core Methodological Advances in DEMR
3.1 Reflective Flipped Fusion (RFF) Block
The RFF block is a multi-layer cross-modal alignment module. Across layers, video and text branches alternate roles in a cross-attention (CA) operation followed by self-attention (SA). For layer : By “flipping” the query-key-value structure between modalities at each layer, RFF progressively aligns video and text, mitigating over-reliance on a single modality.
3.2 Query Reconstruction (QR) Auxiliary Task
To enhance the linguistic grounding of the fusion module, DEMR incorporates a QR pretraining phase. Randomly masked noun tokens in are reconstructed after RFF processing via a lightweight QR head: where is the set of unmasked tokens and the masked nouns. QR is trained for 30 epochs; the head is then frozen and the loss dropped.
3.3 Geometric Regularizer for Uncertainty Calibration
To address the structural limitation of the heuristic regularizer, DEMR normalizes the prediction error and evidence to and introduces a geometric regularizer: enforcing that increased evidence corresponds to low prediction error and vice versa. The evidential loss adopts:
4. Composite Loss and Training Strategy
The complete DEMR loss function is: Training proceeds in two stages: QR-only (30 epochs, learning rate ), followed by full DEMR training with the QR head frozen. Features are pre-extracted from CLIP (ViT-B/32) and SlowFast (ResNet-50) backbones. RFF blocks use 4 layers (), with batch size 32 and AdamW optimizer.
5. Empirical Evaluation and Results
DEMR is evaluated on both standard and debiased datasets:
| Dataset/Setting | [email protected] | [email protected] | mAP/[email protected] | IID–OOD Gap |
|---|---|---|---|---|
| QVHighlights (val) | 65.0 | 49.4 | 43.0 | — |
| Charades-STA (val) | 60.2 | 38.0 | — | — |
| TACoS (val) | 37.3 | 19.4 | — | — |
| Charades-CD (i.i.d) | 62.20 | 43.29 | — | |
| Charades-CD (o.o.d) | 52.46 | 30.97 | — | 9.74/12.32 |
On standard datasets, DEMR’s performance matches or exceeds strong baselines, while debiased benchmarks show a lower IID–OOD gap than prior methods, indicating improved robustness to distributional shift. In ablation studies, each architectural component incrementally reduces modality imbalance and improves retrieval scores. Hyperparameter sweeps indicate best performance at , , with QR optimal at 30–50 epochs. Calibration analysis shows DEMR yields monotonic uncertainty-error relationships and higher epistemic uncertainty in under-represented video regions.
6. Algorithmic Workflow and Code Availability
The DEMR pipeline is summarized as follows:
- Extract and features with CLIP+SlowFast.
- Stage 1 (QR Pretrain): Mask one noun per query, apply RFF, train QR head via for 30 epochs.
- Stage 2 (DEMR Training): Freeze QR head, run RFF, split into MR and evidential heads. Evidence head predicts evidential parameters and loss ; MR head predicts offsets with .
- At inference, apply non-max suppression at threshold 0.7, rank by retrieval score (potentially by low aleatoric uncertainty), and output the top-1 segment.
Implementation code is available at https://github.com/KaijingOfficial/DEMR.
7. Significance and Impact
DEMR addresses two principal shortcomings of evidential approaches to moment retrieval—modality imbalance and structurally unsound regularization—by means of (1) a flip-flop cross-modal fusion mechanism, (2) targeted linguistic pretraining, and (3) an adaptive geometric regularizer. Empirical evaluation confirms significant improvements in effectiveness, calibration, and temporal-semantic robustness, particularly on debiased benchmarks where conventional models exhibit pronounced performance gaps (Huang et al., 30 Nov 2025). A plausible implication is that such debiasing strategies, especially geometric regularization for uncertainty, may generalize to other cross-modal and uncertainty-sensitive retrieval problems.