DAViHD: Dual-Pathway Audio Encoders for Highlights

Updated 5 February 2026

The paper introduces a dual-pathway audio encoder that separates semantic content from dynamic cues to improve video highlight detection accuracy.
It employs multi-head self-attention with early self-attention and gating fusion mechanisms, achieving state-of-the-art results on benchmarks like Mr.HiSum and TVSum.
Comprehensive ablation analyses demonstrate the dynamic pathway’s significant contribution in capturing transient acoustic events and enhancing overall performance.

Dual-Pathway Audio Encoders for Video Highlight Detection (DAViHD) constitute a framework for automatic identification of salient moments in videos via joint modeling of auditory and visual cues, with a particular focus on extracting underutilized spectro-temporal properties of audio. The core innovation is a dual-pathway audio encoder that disentangles high-level semantic content ("what") from fine-grained audio dynamics ("how") and fuses them through attention-aware mechanisms before integration into an audio–visual highlight detection model. DAViHD achieves new state-of-the-art results on the large-scale Mr.HiSum benchmark, demonstrating the importance of sophisticated, dual-faceted audio representations for reliable highlight localization (Joo et al., 3 Feb 2026).

1. Audio-Visual Highlight Detection Pipeline

The DAViHD pipeline operates on videos segmented into 1-second units at 1 fps. Each segment yields three parallel inputs: a resized video frame, a raw 1-second audio waveform, and a corresponding log-Mel spectrogram (16 kHz, 2048-point FFT, 256-sample hop, 128 Mel bands). These are processed by modality-specific encoders:

Visual Encoder ( $E_v$ ): Pre-trained CNN backbone (ResNet-34+3D-CNN on TVSum; Inception-v3 on Mr.HiSum) augmented with a Transformer-based multi-head self-attention module, producing $\mathbf Z_v'\in\mathbb R^{T_f\times D_v}$ .
Dual-Pathway Audio Encoder: Outputs two distinctly computed streams—semantic ( $\mathbf Z_a^s$ ) and dynamics ( $\mathbf Z_a^d$ )—which are refined via self-attention and fused element-wise to form $\mathbf Z_a'$ .
Audio–Audio Fusion: Self-attention is applied independently to both streams (Early-SA), followed by element-wise multiplication (gating) to yield a fused audio representation.
Audio–Visual Fusion and Score Prediction: Bidirectional cross-modal attention enables mutual conditioning between audio and video features with residual connections for unimodal retention. The concatenated features are input to a 3-layer MLP that regresses highlight scores $\hat y_t\in [0,1]$ for each segment, with supervision from an MSE loss,

$\mathcal L_{\rm MSE} = \frac{1}{T}\sum_t (y_t - \hat y_t)^2\,.$

2. Dual-Pathway Audio Encoder: Composition and Mechanisms

2.1. Semantic Pathway ( $E_a^s$ )

The semantic pathway leverages a pre-trained PANNs backbone (ResNet-style CNNs trained on AudioSet), operating on raw waveforms divided into 1-second non-overlapping chunks. It outputs high-level embeddings $\mathbf Z_a^s\in\mathbb R^{T_f\times D_s}$ representative of content categories (e.g., speech, music, general sound events). No pathway-specific loss is applied; gradients propagate via overall regression supervision.

2.2. Dynamics Pathway ( $E_a^d$ )

This pathway processes log-Mel spectrograms $\mathbf S\in\mathbb R^{F\times T}$ , capturing transient acoustic phenomena and spectro-temporal energy fluctuations:

Multi-branch Attention Mechanisms:
- Temporal attention $\boldsymbol\alpha = \mathrm{softmax}(\mathrm{Conv2D}(\mathbf S))$
- Velocity attention $\boldsymbol\beta = \mathrm{softmax}(\mathrm{Conv2D}(\Delta\mathbf S))$ with $\Delta\mathbf S_t = |\mathbf S_t - \mathbf S_{t-1}|$
- Saliency gate $\mathbf x_s = \mathrm{sigmoid}(\mathrm{Conv2D}(\mathbf S))$
Feature Aggregation:
- Time-aware pooling: $\mathbf f_{\rm TA} = \sum_t (\boldsymbol\alpha \odot \mathbf x_s)$
- Velocity-aware pooling: $\mathbf f_{\rm VA} = \sum_t (\boldsymbol\beta \odot \mathbf x_s)$
- Global context: $\mathbf f_{\rm GC}$ via average pooling
- Combined: $\mathbf f_{\rm combined} = \mathbf f_{\rm TA} + \mathbf f_{\rm VA} + \mathbf f_{\rm GC}$
Frequency-Dynamic Convolution:
- Frequency-adaptive coefficients $\{\gamma_k(f)\}_{k=1}^K$ are predicted for $K$ basis kernels $\{\mathbf W_k\}$ using a 1D Conv-block.
- The dynamic filter is composed as:
$\mathbf Z_a^d = \sum_{k=1}^K \gamma_k \odot (\mathbf W_k * \mathbf S)\,,$

with $*$ denoting convolution and $\odot$ element-wise multiplication.
This composite mechanism offers adaptive sensitivity to salient frequency bands indicative of highlights, such as abrupt instrumental transients or crowd noise.

3. Mathematical Formulations

Significant equations structuring the DAViHD architecture include:

Frequency-Dynamic Convolution:

$\mathbf Z_a^d = \sum_{k=1}^K \gamma_k \odot (\mathbf W_k * \mathbf S)\,,$

where $\gamma_k$ are frequency-dynamic coefficients, $\mathbf W_k$ basis kernels, and $\mathbf S$ the input spectrogram.

Audio Pathway Fusion (after self-attention):

$\mathbf Z_a^{\prime s} = \mathrm{Attn}_{\rm self}^s(\mathbf Z_a^s), \quad \mathbf Z_a^{\prime d} = \mathrm{Attn}_{\rm self}^d(\mathbf Z_a^d),$

$\mathbf Z_a' = \mathbf Z_a^{\prime s} \odot \mathbf Z_a^{\prime d}\,.$

Audio–Visual Cross Attention:

$\mathbf Z_{a\to v}' = \mathrm{softmax}\!\left(\frac{Q_v K_a^\top}{\sqrt{d_k}}\right)V_a\,, \quad \mathbf Z_{v\to a}' = \mathrm{softmax}\!\left(\frac{Q_a K_v^\top}{\sqrt{d_k}}\right)V_v$

where $Q_v,K_v,V_v$ (and audio equivalents) are linear projections of visual and auditory streams.

These coupled mechanisms articulate the computational disentanglement and fusion of semantic and dynamic audio cues before cross-modal integration.

4. Training Regimen, Datasets, and Hyperparameters

DAViHD is evaluated on two benchmarks:

Mr.HiSum: 30,656 YouTube videos (mean duration 202 s) with user-driven “most replayed” highlight scores.
TVSum: 50 videos, evaluated with 5-fold cross-validation.

Input preprocessing includes:

No explicit audio augmentation
Video frames center-cropped and resized
Audio spectrograms computed directly on waveforms

Optimization specifics:

Adam optimizer; weight decay $1\times 10^{-4}$
Mr.HiSum: 200 epochs, learning rate $1\times10^{-5}$ , batch size 16
TVSum: 400 epochs, learning rate $5\times 10^{-6}$ , batch size 8
Gradient clipping: max-norm 0.5
Frame-level highlight score regression via MSE loss

5. Performance Evaluation and Ablation Analysis

5.1. Quantitative Results

DAViHD establishes new state-of-the-art scores:

Dataset	Method	F1	mAP₅₀	mAP₁₅	ρ (Spearman)	τ (Kendall)
Mr.HiSum	DAViHD	59.73±0.41	67.27±0.52	36.55±0.51	0.299±0.012	0.213±0.009
	Prior (UMT)	58.18±0.29	65.81	33.79	0.239	0.174
TVSum	DAViHD	57.67±1.27	63.52±2.58	28.94±3.11	0.200±0.032	0.138±0.022
	Prior (CSTA)	≈57.32	≈62.36	-	-	-

5.2. Modality Contributions (Mr.HiSum F1 Scores)

Video only: 52.98
Audio semantic pathway ( $A_s$ ) only: 53.25
Audio dynamics pathway ( $A_d$ ) only: 57.53
Video + semantics: 54.79
Video + dynamics: 58.25
Audio only (both pathways): 59.09
Full model (video + both audio): 60.17

This suggests the dynamic pathway alone is more informative than either vision or semantic audio in isolation. Combining both audio streams nearly matches the full audio-visual model, confirming the critical role of spectro-temporal cues.

5.3. Audio Fusion Ablation (Mr.HiSum F1 Scores)

Late self-attention + concat: 58.71
Late self-attention + multiply: 58.40
Early self-attention + concat: 59.42
Early self-attention + multiply (DAViHD default): 60.17

Placing self-attention before stream fusion and employing gating (element-wise multiplication) achieves superior synergy.

6. Qualitative Observations and Interpretability

DAViHD’s dynamic audio branch identifies and attends to sharp acoustic transients (e.g., drum hits, applause), generating temporally aligned, high-resolution highlight scores that mirror annotated ground truth. Baseline models, in contrast, produce relatively uniform scores, often missing segment-level granularity. The multi-branch dynamics pathway allows retrospective analysis of which time–frequency regions influenced highlight decisions, supporting interpretability and inspection (Joo et al., 3 Feb 2026).

7. Significance and Context

DAViHD’s explicit division of audio modeling into semantic identification and spectro-temporal dynamics addresses a fundamental failing of prior approaches, which often neglected rich audio characteristics in favor of either high-level content or visual cues. The demonstrated performance and analysis underscore the necessity of disentangling “what” and “how” audio evolves, with attention- and gating-based fusion mechanisms proving optimal for exploiting complementary information. A plausible implication is that further advancements in highlight detection may hinge on even finer-grained audio modeling and context-aware fusion strategies.

Markdown Report Issue Upgrade to Chat

References (1)

Sounding Highlights: Dual-Pathway Audio Encoders for Audio-Visual Video Highlight Detection (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Dual-Pathway Audio Encoders for Video Highlight Detection (DAViHD).