AVE Dataset for Audio-Visual Event Localization

Updated 29 December 2025

Audio-Visual Event (AVE) dataset is a large-scale, temporally segmented collection that enables detailed analysis of synchronized audio and visual events in natural scenes.
The dataset is structured into 10-second YouTube clips, each divided into 1-second intervals with manual and algorithmic labeling across 28 event categories for both supervised and weakly supervised learning.
Baseline models, including PSP and attention modules, demonstrate significant improvements in segment-level classification and cross-modal retrieval, underscoring the dataset's impact on multimodal research.

The Audio-Visual Event (AVE) dataset is a large-scale, temporally segmented collection designed for supervised and weakly supervised studies of audiovisual event localization, categorization, and cross-modal retrieval. Sourced from unconstrained YouTube videos and building on the AudioSet ecosystem, the AVE dataset supports rigorous research into the alignment, fusion, and mutual dependence of audio and visual signals in natural scenes (Tian et al., 2018, Zhou et al., 2021).

1. Dataset Construction and Annotation Protocol

The AVE dataset comprises 4,143 publicly available 10-second video clips sampled from YouTube. Each video is algorithmically and manually verified to contain at least one instance of both audible and visible manifestation of one of 28 diverse real-world events. Event categories encompass speech (e.g., “man speaking,” “woman speaking”), animal vocalizations (“dog barking,” “cat meowing”), musical instruments (“playing guitar,” “violin”), vehicles and alarms (“car horn,” “train,” “ambulance siren”), machinery, nature sounds, and various domestic occurrences (“frying food,” “vacuum cleaner,” “doorbell”).

Temporal segmentation divides every clip into $T=10$ contiguous, non-overlapping 1-second intervals, yielding a total of 41,430 labeled segments. For each segment $t$ , annotators assign a label $y_t \in \{1, ..., C, bg\}$ , where $C=28$ and “bg” indicates background or non-event. The strict segment-level annotation protocol specifies that for an audio-visual event to be marked present in segment $t$ , (a) the target sound must be audible in the audio and (b) its source must be visually identifiable in the corresponding frames. Event spans and boundaries are manually verified to the nearest second.

2. Dataset Structure, Statistics, and Splits

Each video provides paired synchronized audio and visual streams. For all 4,143 videos (41,430 segments), the segment-level ground truth consists of a one-hot categorical label over the 28 event classes (plus background). Video lengths are uniform (10 seconds), and exactly 10 1-second segments are allocated per video: $S_t = (S_t^v, S_t^a), t = 1,\ldots,10$ . The dataset is randomly partitioned by video: approximately 60% (2,490 videos) for training, 20% (827 videos) for validation, and 20% (826 videos) for testing (Tian et al., 2018).

A survey of event occurrence reveals that 66.4% of videos contain an event spanning the full 10 seconds, while all videos contain at least one event of 2 seconds or longer. Class distribution is relatively balanced by curation.

3. Localization and Retrieval Task Definitions

The AVE dataset is expressly designed to support three key tasks:

Supervised audio-visual event (AVE) localization: Train on segment-level $(S_t^v, S_t^a)$ and associated $y_t$ . The objective is segment-wise classification (predicting $y_t$ for each $t$ ), evaluated as overall segment-level accuracy:

$\text{Accuracy} = \frac{\# \text{correctly labeled segments}}{41,430}$

Cross-entropy is the typical loss.

Weakly supervised AVE localization: Only video-level label $Y \in \{1,..,C\}$ is provided for training, with no segment boundaries. Models use multiple-instance learning (MIL) or temporal pooling to aggregate segment predictions.
Cross-modality localization (audio–visual segment retrieval): Given a query interval in one modality (audio or visual), the task is to localize the synchronized segment in the other modality using a learned distance function $D_\theta$ .

4. Baseline Models and Fusion Architectures

Feature representation leverages pre-trained networks: VGG-19 for visual features (512×7×7 pool5 maps per segment) and a VGG-style network for 128-dimensional log-mel audio embeddings. Both modalities pass through temporal modeling layers, such as LSTMs or BiLSTMs.

The field has advanced through increasingly sophisticated fusion and alignment modules:

Audio-guided visual attention—Audio features modulate spatial softmax distributions over visual feature maps $V_t$ , producing $v_t^{\text{att}}$ as a weighted sum.
Dual multimodal residual network (DMRN)—Hidden states $h_t^v, h_t^a$ from LSTM encoders are fused with small MLPs and residual connections; the fused state $h_t^*$ is employed for final classification.
Positive Sample Propagation (PSP; (Zhou et al., 2021))—All-pair similarity maps $\beta_{ij}^{va}$ $β_{ij}^{v a}$ are constructed between every audio and visual segment pair. After strict thresholding (with empirically stable $\tau \in [0, 0.115]$ $τ \in [0, 0.115]$ ), only positive links propagate features bidirectionally across modalities, with learned weights $W_1^{(\cdot)}$ $W_{1}^{(\cdot)}$ , $W_2^{(\cdot)}$ $W_{2}^{(\cdot)}$ . This mechanism isolates mutually supportive segment pairs and suppresses asynchronous or background noise. The approach is further enhanced by:
- An audio-visual pair similarity loss $\mathcal{L}_{avps}$ (mean squared error between normalized affinity and ground truth event mask) in the fully supervised regime;
- A temporal weighting branch in weakly supervised settings to highlight high-confidence segments during MIL pooling.

5. Evaluation Metrics and Benchmark Results

The canonical metric is segment-level classification accuracy. Supervised and weakly supervised settings both employ overall percentage correct on the test set. For retrieval tasks, exact-match accuracy (percentage of queries with perfectly localized counterpart segments) is used.

Reported baselines for AVE (Tian et al., 2018, Zhou et al., 2021) are as follows:

Method	Sup. Acc. (%)	Weak-Sup. Acc. (%)
Audio only	59.5	53.4
Visual only	55.3	52.9
AV + Att.	72.7	66.7
DMRN (fused)	73.1	—
Positive Sample Propagation (PSP)	77.8	73.5
AVEL (prior SOTA)	68.6	66.7

PSP achieves a +4.1% supervised improvement over “w/o PSP” and +9.2% over AVEL; the weakly supervised gain is +3.3% over “w/o PSP” and +6.8% over AVEL. PSP matches or surpasses contemporaneous methods (AVT, CMRA, CMAN) on both evaluation protocols. Ablation shows marked drops in accuracy if PSP is removed or altered, and the weighting branch and $\mathcal{L}_{avps}$ each provide significant, additive improvements (Zhou et al., 2021).

For cross-modality localization, a contrastive-network scheme reaches 44.8% accuracy (audio-to-visual), compared to 34.8% with DCCA (Tian et al., 2018).

6. Methodological Insights and Analysis

Empirical analysis confirms that models leveraging joint audio–visual representations consistently outperform unimodal or early-fusion baselines by >15 points. Attention modules and thresholded similarity propagation (PSP) both serve to localize sounding objects and temporally align modalities. PSP’s hard thresholding and sparse affinity matrices ( $\gamma^{va}, \gamma^{av}$ ) result in feature propagation restricted to strongly correlated segment pairs and yield more discriminative clustering as visualized by t-SNE. Removing these modules, or replacing with simpler self-attention, degrades accuracy by several absolute percentage points (Zhou et al., 2021).

Qualitatively, model-generated segment attention aligns with event sources (e.g., “frying food” is associated with both the audio sizzle and frames containing pan/food rather than confounding kitchen activity), and feature representations form tighter clusters per event class.

7. Limitations and Impact

The AVE dataset, by design, requires models to resolve two core challenges: (1) temporal localization of events with only 1 s granularity over both domains, and (2) robust handling of modest temporal asynchrony, noisy or non-informative background, and multi-class event boundaries. A noted limitation is that the PSP approach’s hard pruning may exclude weak-but-real cross-modal cues; the necessity to tune the threshold parameter $\tau$ also introduces a modest degree of hyperparameter sensitivity. A further observation is that segment-level weak supervision trails full supervision by $\sim$ 4.3 percentage points, signifying an open problem in fine-grained event discovery (Zhou et al., 2021).

The AVE dataset remains the benchmark of choice for large-scale, segment-level audio-visual event localization, and its design—balanced event classes, fine temporal granularity, and explicit segment annotation—enables direct comparison of cross-modal learning mechanisms, attention modules, and weak supervision strategies.