Papers
Topics
Authors
Recent
2000 character limit reached

Debiased Evidential Learning for Moment Retrieval

Updated 7 December 2025
  • The paper introduces DEMR, a framework that integrates evidential learning, Reflective Flipped Fusion, and query reconstruction to address cross-modal imbalance and miscalibrated uncertainties.
  • The methodology employs a progressive fusion strategy and a geometric regularizer to align video-text features and ensure lower prediction error with increased evidential support.
  • Empirical results on benchmarks like QVHighlights and Charades-STA show that DEMR outperforms conventional models, achieving higher retrieval accuracy and enhanced robustness under debiased evaluations.

Debiased Evidential Learning for Moment Retrieval (DEMR) is a framework designed for robust temporal localization of moments in videos based on natural language queries. It resolves foundational challenges in moment retrieval—most notably, cross-modal imbalance and miscalibrated uncertainty estimation—by integrating evidential learning, a progressive fusion strategy, and an auxiliary query-grounding task. DEMR improves both retrieval accuracy and calibration under standard and debiased benchmarks, providing enhanced temporal-semantic robustness (Huang et al., 30 Nov 2025).

1. Formal Problem Setup and Task Definition

The objective of moment retrieval is to accurately predict temporal segments within a video V={vi}i=1LvV = \{v_i\}_{i=1}^{L_v}, where viRDv_i \in \mathbb{R}^D, conditioned on a textual query Q={qj}j=1LqQ = \{q_j\}_{j=1}^{L_q} with qjRDq_j \in \mathbb{R}^D. The ground truth is a segment m=[ms,me]m = [m^s, m^e] specified as clip indices with 1ms<meLv1 \le m^s < m^e \le L_v. The system generates candidate segments {m~i=[m~is,m~ie]}i=1Lv\{\tilde m_i = [\tilde m_i^s, \tilde m_i^e]\}_{i=1}^{L_v}, ranks them, and selects the top prediction by retrieval score.

The standard supervised loss comprises two terms over foreground clips (fi=1f_i=1): $\mathcal{L}_{mr} = \sum_{i=1}^{L_v} \mathds{1}_{f_i=1} \left[ \lambda_{L1}\, \mathcal{L}_{\mathrm{SmoothL1}}\bigl(\tilde d_i, d_i\bigr) + \lambda_{\mathrm{iou}}\, \mathcal{L}_{\mathrm{gIoU}}(\tilde m_i, m_i) \right],$ where d~i=m~iem~is\tilde d_i = \tilde m_i^e - \tilde m_i^s and di=miemisd_i = m^e_i - m^s_i.

2. Deep Evidential Regression Baseline and Its Limitations

The vanilla baseline for uncertainty-aware localization uses Deep Evidential Regression (DER), modeling the boundary bi{ms,me}b_i \in \{m^s, m^e\} under a Normal-inverse-Gamma prior: p(μ,σ2γ,υ,α,β)=N(μγ,σ2υ)Γ1(σ2α,β),p(\mu, \sigma^2 \mid \gamma, \upsilon, \alpha, \beta) = \mathcal{N}(\mu \mid \gamma, \tfrac{\sigma^2}{\upsilon})\, \Gamma^{-1}(\sigma^2 \mid \alpha, \beta), with evidential parameters (γ,υ,α,β)(\gamma, \upsilon, \alpha, \beta) output by a linear head. The marginal likelihood is Student-tt, yielding a negative log-likelihood training loss: LiNLL=log(St(bi;γ,β(1+υ)υα,2α))\mathcal{L}_i^{\rm NLL} = -\log\left( \mathrm{St}(b_i; \gamma, \frac{\beta(1+\upsilon)}{\upsilon\alpha}, 2\alpha) \right) and a heuristic uncertainty regularizer: LiR=ΔiΦi,Δi=biγ,Φi=2υ+α.\mathcal{L}^R_i = \Delta_i\,\Phi_i,\quad \Delta_i = |b_i-\gamma|, \quad \Phi_i = 2\upsilon + \alpha. ALEATORIC and EPISTEMIC uncertainties are, respectively, E[σ2]=β/(α1)\mathbb{E}[\sigma^2]=\beta/(\alpha-1), Var[μ]=β/(υ(α1))\mathrm{Var}[\mu]=\beta/(\upsilon(\alpha-1)). The total baseline loss (for NN samples) is: Lbase=Lmr+λder1Ni=1N(λNLLLiNLL+λRegLiR).\mathcal{L}_{\rm base} = \mathcal{L}_{mr} + \lambda_{\rm der} \frac{1}{N} \sum_{i=1}^N \left( \lambda_{\rm NLL} \mathcal{L}_i^{\rm NLL} + \lambda_{\rm Reg} \mathcal{L}_i^R \right).

However, two major issues emerge: (A) cross-modal (video-text) imbalance—where uncertainty is unduly sensitive to noise in the visual modality and (B) the uncertainty regularizer’s gradient depends solely on prediction error and ignores current evidence, causing pathological uncertainty assignments.

3. Core Methodological Advances in DEMR

3.1 Reflective Flipped Fusion (RFF) Block

The RFF block is a multi-layer cross-modal alignment module. Across nn layers, video and text branches alternate roles in a cross-attention (CA) operation followed by self-attention (SA). For layer ii: CAvq(i)=Softmax(V(i)Q(i)dk)Q(i), SAq(i)(X)=Softmax(XXdk)X, Q(i+1)=SAq(i)(CAvq(i)), CAqv(i)=Softmax(Q(i)V(i)dk)V(i), SAv(i)(Y)=Softmax(YYdk)Y, V(i+1)=SAv(i)(CAqv(i)).\begin{aligned} &CA_{v\to q}^{(i)} = \mathrm{Softmax}\left(\frac{V^{(i)} Q^{(i)\top}}{\sqrt{d_k}}\right) Q^{(i)}, \ &SA^{(i)}_q(X) = \mathrm{Softmax}\left(\frac{X X^\top}{\sqrt{d_k}}\right) X, \ &Q^{(i+1)} = SA^{(i)}_q(CA_{v\to q}^{(i)}), \ &CA_{q\to v}^{(i)} = \mathrm{Softmax}\left(\frac{Q^{(i)} V^{(i)\top}}{\sqrt{d_k}}\right) V^{(i)}, \ &SA^{(i)}_v(Y) = \mathrm{Softmax}\left(\frac{Y Y^\top}{\sqrt{d_k}}\right) Y, \ &V^{(i+1)} = SA^{(i)}_v(CA_{q\to v}^{(i)}). \end{aligned} By “flipping” the query-key-value structure between modalities at each layer, RFF progressively aligns video and text, mitigating over-reliance on a single modality.

3.2 Query Reconstruction (QR) Auxiliary Task

To enhance the linguistic grounding of the fusion module, DEMR incorporates a QR pretraining phase. Randomly masked noun tokens in QQ are reconstructed after RFF processing via a lightweight QR head: Lqr=E[k=1llogP(wkU,V)],\mathcal{L}_{qr} = \mathbb{E}\left[ -\sum_{k=1}^l \log P(w_k \mid U, V) \right], where UU is the set of unmasked tokens and wkw_k the masked nouns. QR is trained for 30 epochs; the head is then frozen and the loss dropped.

3.3 Geometric Regularizer for Uncertainty Calibration

To address the structural limitation of the heuristic regularizer, DEMR normalizes the prediction error Δi\Delta_i and evidence Φi\Phi_i to Δˉi,Φˉi[0,1]\bar\Delta_i, \bar\Phi_i \in [0,1] and introduces a geometric regularizer: LiL=Φˉi+Δˉi122,\mathcal{L}^L_i = \|\bar\Phi_i + \bar\Delta_i - 1\|_2^2, enforcing that increased evidence corresponds to low prediction error and vice versa. The evidential loss adopts: Lie=λNLLLiNLL+λgeomLiL.\mathcal{L}_i^e = \lambda_{\rm NLL} \mathcal{L}_i^{\rm NLL} + \lambda_{\rm geom} \mathcal{L}_i^L.

4. Composite Loss and Training Strategy

The complete DEMR loss function is: L=Lmr+λder2Ni=1N(λNLLLiNLL+λgeomLiL)+Lqr.\mathcal{L} = \mathcal{L}_{mr} + \lambda_{\rm der} \frac{2}{N} \sum_{i=1}^N \left( \lambda_{\rm NLL} \mathcal{L}_i^{\rm NLL} + \lambda_{\rm geom} \mathcal{L}_i^L \right) + \mathcal{L}_{qr}. Training proceeds in two stages: QR-only (30 epochs, learning rate 10510^{-5}), followed by full DEMR training with the QR head frozen. Features are pre-extracted from CLIP (ViT-B/32) and SlowFast (ResNet-50) backbones. RFF blocks use 4 layers (dk=D=512d_k = D = 512), with batch size 32 and AdamW optimizer.

5. Empirical Evaluation and Results

DEMR is evaluated on both standard and debiased datasets:

Dataset/Setting [email protected] [email protected] mAP/[email protected] IID–OOD Gap Δ\Delta
QVHighlights (val) 65.0 49.4 43.0
Charades-STA (val) 60.2 38.0
TACoS (val) 37.3 19.4
Charades-CD (i.i.d) 62.20 43.29
Charades-CD (o.o.d) 52.46 30.97 9.74/12.32

On standard datasets, DEMR’s performance matches or exceeds strong baselines, while debiased benchmarks show a lower IID–OOD gap than prior methods, indicating improved robustness to distributional shift. In ablation studies, each architectural component incrementally reduces modality imbalance ΔVar\Delta_{\rm Var} and improves retrieval scores. Hyperparameter sweeps indicate best performance at λgeom=102\lambda_{\rm geom} = 10^{-2}, λder=103\lambda_{\rm der} = 10^{-3}, with QR optimal at 30–50 epochs. Calibration analysis shows DEMR yields monotonic uncertainty-error relationships and higher epistemic uncertainty in under-represented video regions.

6. Algorithmic Workflow and Code Availability

The DEMR pipeline is summarized as follows:

  1. Extract VV and QQ features with CLIP+SlowFast.
  2. Stage 1 (QR Pretrain): Mask one noun per query, apply RFF, train QR head via Lqr\mathcal{L}_{qr} for 30 epochs.
  3. Stage 2 (DEMR Training): Freeze QR head, run RFF, split into MR and evidential heads. Evidence head predicts evidential parameters and loss LNLL+LL\mathcal{L}^{\rm NLL} + \mathcal{L}^L; MR head predicts offsets with Lmr\mathcal{L}_{mr}.
  4. At inference, apply non-max suppression at threshold 0.7, rank by retrieval score (potentially by low aleatoric uncertainty), and output the top-1 segment.

Implementation code is available at https://github.com/KaijingOfficial/DEMR.

7. Significance and Impact

DEMR addresses two principal shortcomings of evidential approaches to moment retrieval—modality imbalance and structurally unsound regularization—by means of (1) a flip-flop cross-modal fusion mechanism, (2) targeted linguistic pretraining, and (3) an adaptive geometric regularizer. Empirical evaluation confirms significant improvements in effectiveness, calibration, and temporal-semantic robustness, particularly on debiased benchmarks where conventional models exhibit pronounced performance gaps (Huang et al., 30 Nov 2025). A plausible implication is that such debiasing strategies, especially geometric regularization for uncertainty, may generalize to other cross-modal and uncertainty-sensitive retrieval problems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Debiased Evidential Learning for Moment Retrieval (DEMR).