DAViHD: Dual-Pathway Audio Encoders for Highlights
- The paper introduces a dual-pathway audio encoder that separates semantic content from dynamic cues to improve video highlight detection accuracy.
- It employs multi-head self-attention with early self-attention and gating fusion mechanisms, achieving state-of-the-art results on benchmarks like Mr.HiSum and TVSum.
- Comprehensive ablation analyses demonstrate the dynamic pathway’s significant contribution in capturing transient acoustic events and enhancing overall performance.
Dual-Pathway Audio Encoders for Video Highlight Detection (DAViHD) constitute a framework for automatic identification of salient moments in videos via joint modeling of auditory and visual cues, with a particular focus on extracting underutilized spectro-temporal properties of audio. The core innovation is a dual-pathway audio encoder that disentangles high-level semantic content ("what") from fine-grained audio dynamics ("how") and fuses them through attention-aware mechanisms before integration into an audio–visual highlight detection model. DAViHD achieves new state-of-the-art results on the large-scale Mr.HiSum benchmark, demonstrating the importance of sophisticated, dual-faceted audio representations for reliable highlight localization (Joo et al., 3 Feb 2026).
1. Audio-Visual Highlight Detection Pipeline
The DAViHD pipeline operates on videos segmented into 1-second units at 1 fps. Each segment yields three parallel inputs: a resized video frame, a raw 1-second audio waveform, and a corresponding log-Mel spectrogram (16 kHz, 2048-point FFT, 256-sample hop, 128 Mel bands). These are processed by modality-specific encoders:
- Visual Encoder (): Pre-trained CNN backbone (ResNet-34+3D-CNN on TVSum; Inception-v3 on Mr.HiSum) augmented with a Transformer-based multi-head self-attention module, producing .
- Dual-Pathway Audio Encoder: Outputs two distinctly computed streams—semantic () and dynamics ()—which are refined via self-attention and fused element-wise to form .
- Audio–Audio Fusion: Self-attention is applied independently to both streams (Early-SA), followed by element-wise multiplication (gating) to yield a fused audio representation.
- Audio–Visual Fusion and Score Prediction: Bidirectional cross-modal attention enables mutual conditioning between audio and video features with residual connections for unimodal retention. The concatenated features are input to a 3-layer MLP that regresses highlight scores for each segment, with supervision from an MSE loss,
2. Dual-Pathway Audio Encoder: Composition and Mechanisms
2.1. Semantic Pathway ()
The semantic pathway leverages a pre-trained PANNs backbone (ResNet-style CNNs trained on AudioSet), operating on raw waveforms divided into 1-second non-overlapping chunks. It outputs high-level embeddings representative of content categories (e.g., speech, music, general sound events). No pathway-specific loss is applied; gradients propagate via overall regression supervision.
2.2. Dynamics Pathway ()
This pathway processes log-Mel spectrograms , capturing transient acoustic phenomena and spectro-temporal energy fluctuations:
- Multi-branch Attention Mechanisms:
- Temporal attention
- Velocity attention with
- Saliency gate
- Feature Aggregation:
- Time-aware pooling:
- Velocity-aware pooling:
- Global context: via average pooling
- Combined:
- Frequency-Dynamic Convolution:
- Frequency-adaptive coefficients are predicted for basis kernels using a 1D Conv-block.
- The dynamic filter is composed as:
with denoting convolution and element-wise multiplication.
- This composite mechanism offers adaptive sensitivity to salient frequency bands indicative of highlights, such as abrupt instrumental transients or crowd noise.
3. Mathematical Formulations
Significant equations structuring the DAViHD architecture include:
- Frequency-Dynamic Convolution:
where are frequency-dynamic coefficients, basis kernels, and the input spectrogram.
- Audio Pathway Fusion (after self-attention):
- Audio–Visual Cross Attention:
where (and audio equivalents) are linear projections of visual and auditory streams.
These coupled mechanisms articulate the computational disentanglement and fusion of semantic and dynamic audio cues before cross-modal integration.
4. Training Regimen, Datasets, and Hyperparameters
DAViHD is evaluated on two benchmarks:
- Mr.HiSum: 30,656 YouTube videos (mean duration 202 s) with user-driven “most replayed” highlight scores.
- TVSum: 50 videos, evaluated with 5-fold cross-validation.
Input preprocessing includes:
- No explicit audio augmentation
- Video frames center-cropped and resized
- Audio spectrograms computed directly on waveforms
Optimization specifics:
- Adam optimizer; weight decay
- Mr.HiSum: 200 epochs, learning rate , batch size 16
- TVSum: 400 epochs, learning rate , batch size 8
- Gradient clipping: max-norm 0.5
- Frame-level highlight score regression via MSE loss
5. Performance Evaluation and Ablation Analysis
5.1. Quantitative Results
DAViHD establishes new state-of-the-art scores:
| Dataset | Method | F1 | mAP₅₀ | mAP₁₅ | ρ (Spearman) | τ (Kendall) |
|---|---|---|---|---|---|---|
| Mr.HiSum | DAViHD | 59.73±0.41 | 67.27±0.52 | 36.55±0.51 | 0.299±0.012 | 0.213±0.009 |
| Prior (UMT) | 58.18±0.29 | 65.81 | 33.79 | 0.239 | 0.174 | |
| TVSum | DAViHD | 57.67±1.27 | 63.52±2.58 | 28.94±3.11 | 0.200±0.032 | 0.138±0.022 |
| Prior (CSTA) | ≈57.32 | ≈62.36 | - | - | - |
5.2. Modality Contributions (Mr.HiSum F1 Scores)
- Video only: 52.98
- Audio semantic pathway () only: 53.25
- Audio dynamics pathway () only: 57.53
- Video + semantics: 54.79
- Video + dynamics: 58.25
- Audio only (both pathways): 59.09
- Full model (video + both audio): 60.17
This suggests the dynamic pathway alone is more informative than either vision or semantic audio in isolation. Combining both audio streams nearly matches the full audio-visual model, confirming the critical role of spectro-temporal cues.
5.3. Audio Fusion Ablation (Mr.HiSum F1 Scores)
- Late self-attention + concat: 58.71
- Late self-attention + multiply: 58.40
- Early self-attention + concat: 59.42
- Early self-attention + multiply (DAViHD default): 60.17
Placing self-attention before stream fusion and employing gating (element-wise multiplication) achieves superior synergy.
6. Qualitative Observations and Interpretability
DAViHD’s dynamic audio branch identifies and attends to sharp acoustic transients (e.g., drum hits, applause), generating temporally aligned, high-resolution highlight scores that mirror annotated ground truth. Baseline models, in contrast, produce relatively uniform scores, often missing segment-level granularity. The multi-branch dynamics pathway allows retrospective analysis of which time–frequency regions influenced highlight decisions, supporting interpretability and inspection (Joo et al., 3 Feb 2026).
7. Significance and Context
DAViHD’s explicit division of audio modeling into semantic identification and spectro-temporal dynamics addresses a fundamental failing of prior approaches, which often neglected rich audio characteristics in favor of either high-level content or visual cues. The demonstrated performance and analysis underscore the necessity of disentangling “what” and “how” audio evolves, with attention- and gating-based fusion mechanisms proving optimal for exploiting complementary information. A plausible implication is that further advancements in highlight detection may hinge on even finer-grained audio modeling and context-aware fusion strategies.