Audio-Visual Scene Classification

Updated 18 April 2026

Audio-Visual Scene Classification (AVSC) is a supervised learning task that assigns semantic labels to video clips by integrating signals from audio waveforms and visual frames.
Modern AVSC methods leverage deep learning architectures and advanced fusion strategies to achieve over 94% accuracy on standardized urban scene benchmarks.
Key challenges include achieving precise cross-modal alignment, implementing robust data augmentation, and handling asynchronous or missing modalities in real-world applications.

Audio-Visual Scene Classification (AVSC) is the supervised task of assigning a discrete semantic label (e.g., “airport”, “metro station”, “riot scene”) to an audio-visual input segment—typically a short video clip—by integrating information from both its acoustic (waveform/spectrogram) and visual (frame/image) streams. Modern AVSC research has been driven by the development of large, curated datasets of real-world synchronized audio-video recordings and the rapid advancement of multimodal deep learning architectures able to exploit signals that are often complementary, redundant, or semantically entangled across modalities. Performance improvements hinge on effective multimodal fusion, fine-grained cross-modal alignment, and robust data augmentations. State-of-the-art systems routinely achieve over 94% classification accuracy on standardized scene taxonomies using large pretrained networks and sophisticated fusion methods.

1. Benchmark Datasets and Task Definition

The canonical AVSC benchmark datasets are derived from curated and standardized corpora such as TAU Audio-Visual Urban Scenes 2021 and related urban scene datasets recorded in multiple cities using tightly-synchronized professional equipment (Wang et al., 2020). Typical data consist of 10 s video clips recorded at 24–48 kHz audio and 25–30 Hz video, chopped into training and test splits that enforce location, device, and city disjointness between train and test. Each clip is annotated with a single scene label from a closed, mutually exclusive set (e.g., airport, tram, park, public square, metro station; typically 5–10 classes) (Pham et al., 2021, Apostolidis et al., 2024).

Scene class distribution, recording protocols, and meta-annotations (city, environment, device) are tightly controlled to facilitate reproducible evaluation. Standard evaluation metrics are overall classification accuracy and multiclass cross-entropy (log-loss) at the segment level (Wang et al., 2021, Chen et al., 2022). DCASE Challenge protocols and public leaderboards have driven both data and methodological standardization.

2. Audio and Visual Feature Engineering

Audio feature extraction begins with waveform resampling and channel fusion (average/difference for binaural), followed by time-frequency representations such as log-mel spectrograms, gammatonegrams, or Morlet wavelet scalograms (Chen et al., 2022). Feature maps are fed into deep architectures: pretrained or scratch-trained CNNs (ResNet, FCNN, VGG, PANNs, Wavegram-Logmel) with channel attention, squeeze-excitation, or 1D-Res-DCNN blocks. Visual features are extracted either from individual frames or frame sequences (sampled at 1–10 Hz) after resizing/cropping, fed into architectures such as VGG, DenseNet, ResNet, EfficientNet, CLIP, or transformer-based backbones (ConvNeXt, ViT) often pretrained on ImageNet/Places365 and optionally fine-tuned on AVSC data (Pham et al., 2021, Wang et al., 2022, Apostolidis et al., 2024).

The effectiveness of visual features typically surpasses audio for static or visually distinctive scenes, while acoustic features provide critical disambiguation for semantically confusable or acoustically unique environments (e.g., street traffic vs. tram) (Naranjo-Alcazar et al., 2021, Pham et al., 2021). Direct spectrogram or frame-level training with end-to-end fine-tuning on AVSC target data consistently outperforms pure embedding-based transfer or frozen extractors (Wang et al., 2022, Pham et al., 2021).

3. Multimodal Fusion Strategies

Fusion architectures are broadly classified as early (feature-level), mid, late (score-level), or hybrid. Early fusion typically concatenates or jointly attends to audio and visual embeddings before classification, relying on dense or attention-based MLPs (Wang et al., 2022, Chen et al., 2022). Late (score-level) fusion combines uni-modal classifier posteriors, via averaging, product, or max reduction (Pham et al., 2021, Pham et al., 2021). Hybrid systems combine model-level fusion with ensembling over backbone and fusion variants (Wang et al., 2021).

Recent advances exploit cross-modal contrastive learning and graph-based reasoning. Contrastive Event-Object Alignment (CEOA) explicitly aligns audio event and visual object embeddings via InfoNCE-style or margin-based losses, operating at fine granularity before semantic fusion (Hou et al., 2022). Attentional Graph Convolutional Networks (AGCN) build graphs over salient acoustic or visual regions, using attention to identify semantically relevant nodes, then apply spectral graph convolutions for structure-aware cross-modal reasoning (Zhou et al., 2022).

Multi-head cross-attention modules further allow semantic-based fusion (SF), dynamically weighting and integrating event/object representations across modalities for robust joint embeddings (Hou et al., 2022). Empirical results confirm early or hybrid fusion (with explicit cross-modal modeling) outperform late fusion in most settings, especially for complex, crowded, or ambiguous scenes.

4. Data Augmentation, Regularization, and Optimization

Data augmentation is essential for generalization and robust multimodal representation. Audio augmentations encompass pitch shifting, speed perturbation, additive Gaussian noise, mixup, and SpecAugment (time-frequency masking) (Wang et al., 2022, Naranjo-Alcazar et al., 2021). Visual augmentations include random crop, horizontal flip, brightness/contrast jitter, and RandAugment (with class-specific policy selection) (Wang et al., 2022). Mixup is extended to joint (audio+visual) mixup, mixing both modalities and soft-targets (Wang et al., 2022, Pham et al., 2021).

Loss functions are typically multiclass cross-entropy for hard labels or KL-divergence for mixup-produced soft labels, often with ℓ₂ weight regularization (Pham et al., 2021, Pham et al., 2021). Training protocols employ SGD or AdamW, with learning-rate decay schedules, batch normalization, and heavy dropout between dense or pooling layers.

Optimization can be staged: (a) separate unimodal subnetworks trained before fusion, (b) joint fine-tuning of all parameters, (c) freezing pretrained backbones (often for vision) while optimizing the remaining network for cross-modal discrimination (Chen et al., 2022).

5. State of the Art: Architectures and Quantitative Performance

Recent systems leverage large pretrained backbones, explicit multimodal fusion, and strong data augmentation. For example, CEOA+SF (AST+ConvNeXt) achieves 94.1% accuracy on TAU-AVSC without extra data or ensemble (Hou et al., 2022); AGCN attains 91.6% audio-visual scene accuracy with substantially reduced parameter count versus competing semantic-branch models (Zhou et al., 2022). Stacked ensemble late-fusion (product over VGG and Inception/DenseNet backbones) yields up to 95.7% accuracy on five-class crowded scene taxonomy (Pham et al., 2021). Joint optimization of acoustic encoder and scene classifier (with frozen visual encoder) delivers up to 94.6% accuracy on DCASE 2021 (Chen et al., 2022). Ablation studies consistently demonstrate that audio and visual modalities are complementary: fusion reduces error by 2–10% over the best unimodal system, and data augmentation further adds 0.5–1% absolute improvement.

A representative performance summary from major DCASE AVSC benchmarks:

Method	Fusion Strategy	Accuracy (%)	Params	Reference
AGCN (ResNet+graph)	Early (attn/graph)	91.6	~1/5 of SOTA	(Zhou et al., 2022)
CEOA+SF (AST+ConvNeXt)	Contrastive + cross-attn	94.1	--	(Hou et al., 2022)
Joint opt. (Scalo+EffNetV2)	Early (cat, joint-train)	94.6	--	(Chen et al., 2022)
Ensemble fusion (VGG/Incep/DN)	Late (prod)	95.7	--	(Pham et al., 2021)
Baseline OpenL3	Early (concat)	84.8	13.4M	(Wang et al., 2020)

6. Visualization, Interpretability, and Qualitative Analysis

Graph-based and attentional models allow direct visualization of feature saliency on spectrograms and images. Salient acoustic/visual graph nodes (SAG/SVG) focus tightly on class-discriminative regions—high-energy spectral patches or key visual objects—while contextual nodes (CAG/CVG) provide distributed background context (Zhou et al., 2022). Cross-modal attention mappings (SF) illuminate which audio events trigger corresponding visual object activations, and vice versa, supporting fine-grained semantic fusion (Hou et al., 2022). Error analysis reveals that unimodal systems are confusable in visually or acoustically similar scenes, whereas fusion corrects most paired confusion errors (e.g., “tram” vs. “bus”, “public square” vs. “pedestrian street”) (Apostolidis et al., 2024).

7. Emerging Directions and Open Challenges

Despite rapid progress, limitations persist. Most systems assume strongly synchronized and non-missing modalities; robustness to asynchronous or missing streams remains challenging (Chen et al., 2022). Nearly all state-of-the-art models are resource-intensive; parameter-efficient or transformer-based models for edge deployment are an open research question (Wang et al., 2021). Dataset diversity is still limited: augmenting with “in the wild”, multi-device, domain-adaptive, or hierarchical scene datasets is needed (Pham et al., 2021). Incorporating explicit spatio-temporal modeling (3D CNNs, video transformers), semantic knowledge graphs, and adversarial regularization for content verification (e.g., audio-visual discrepancy detection for media forensics) are active research areas (Apostolidis et al., 2024).

Cross-modal representation learning, deeper semantic alignment, and robust fusion under adversarial or OOD conditions are expected to define the next generation of AVSC systems.