AVAR-Net: A Lightweight Audio-Visual Anomaly Recognition Framework with a Benchmark Dataset

Published 15 Oct 2025 in cs.CV | (2510.13630v2)

Abstract: Anomaly recognition plays a vital role in surveillance, transportation, healthcare, and public safety. However, most existing approaches rely solely on visual data, making them unreliable under challenging conditions such as occlusion, low illumination, and adverse weather. Moreover, the absence of large-scale synchronized audio-visual datasets has hindered progress in multimodal anomaly recognition. To address these limitations, this study presents AVAR-Net, a lightweight and efficient audio-visual anomaly recognition framework designed for real-world environments. AVAR-Net consists of four main modules: an audio feature extractor, a video feature extractor, fusion strategy, and a sequential pattern learning network that models cross-modal relationships for anomaly recognition. Specifically, the Wav2Vec2 model extracts robust temporal features from raw audio, while MobileViT captures both local and global visual representations from video frames. An early fusion mechanism combines these modalities, and a Multi-Stage Temporal Convolutional Network (MTCN) model that learns long-range temporal dependencies within the fused representation, enabling robust spatiotemporal reasoning. A novel Visual-Audio Anomaly Recognition (VAAR) dataset, is also introduced, serving as a medium-scale benchmark containing 3,000 real-world videos with synchronized audio across ten diverse anomaly classes. Experimental evaluations demonstrate that AVAR-Net achieves 89.29% accuracy on VAAR and 88.56% Average Precision on the XD-Violence dataset, improving Average Precision by 2.8% over existing state-of-the-art methods. These results highlight the effectiveness, efficiency, and generalization capability of the proposed framework, as well as the utility of VAAR as a benchmark for advancing multimodal anomaly recognition research.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper introduces AVAR-Net, which effectively fuses audio and visual cues to enhance anomaly detection in challenging conditions.
It employs Wav2Vec2 and MobileViT for robust audio and video feature extraction, coupled with an early fusion strategy for joint representation learning.
Results on the VAAR dataset show state-of-the-art performance with 89.29% accuracy and 88.56% AP, underscoring its practical value in surveillance applications.

AVAR-Net: A Lightweight Audio-Visual Anomaly Recognition Framework with a Benchmark Dataset

Introduction

"AVAR-Net: A Lightweight Audio-Visual Anomaly Recognition Framework with a Benchmark Dataset" addresses the crucial need for effective anomaly recognition by introducing a novel framework and dataset. Anomaly recognition is critical in domains like surveillance and public safety, especially under non-ideal conditions such as occlusion or poor lighting. However, existing methods heavily rely on visual data, which can be insufficient. This paper presents AVAR-Net, which effectively integrates audio and visual data to improve anomaly recognition, leveraging cutting-edge architectures and introducing a new benchmark dataset—VAAR—for this purpose.

AVAR-Net Framework

AVAR-Net's architecture consists of several key components: an audio feature extractor (Wav2Vec2), a video feature extractor (MobileViT), an early fusion strategy, and a sequential pattern learning network using Multi-Stage Temporal Convolutional Networks (MTCN).

Audio Feature Extraction: Wav2Vec2 is utilized to derive robust temporal audio features from raw waveforms. This proven architecture is known for capturing high-resolution audio representations that are less affected by noise and distortion, making it ideal for extracting meaningful patterns from complex environmental audio.
Video Feature Extraction: MobileViT is deployed to capture spatial and temporal visual cues from video data. This model balances the efficiency of mobile networks with the modeling capacity of transformers, enabling effective local and global feature extraction.
Fusion Strategy: The early fusion mechanism allows for the integration of audio and visual data at the feature level, promoting the learning of joint representations and enabling better capture of complementary cues.
Temporal Modeling with MTCN: The MTCN enhances the ability to learn long-range temporal dependencies and complex spatiotemporal anomalies by using dilated convolutions and attention modules. This design choice allows for robust sequence modeling and efficient computational performance.

VAAR Dataset

The introduction of the VAAR dataset significantly advances the field of multimodal anomaly recognition. It contains 3,000 labeled video clips across ten anomaly classes, each synchronized with corresponding audio data, providing a comprehensive platform for developing and evaluating novel recognition approaches. This dataset addresses limitations found in existing benchmarks, such as lack of modality diversity or simplicity, and offers real-world applicability by including varied scenes with rich contextual cues.

Experimental Results

The paper presents quantitative evaluations showing that AVAR-Net achieves state-of-the-art performance across both the proposed VAAR dataset and the existing XD-Violence dataset. Specifically, AVAR-Net achieves 89.29% accuracy on VAAR and 88.56% Average Precision on XD-Violence, marking improvements over current leading methods by 2.8% AP. These results reveal that AVAR-Net not only meets but exceeds current standards, offering exceptional generalization and applicability to real-world settings.

Conclusion

The research provides a significant step forward in multimodal anomaly detection, evidencing the importance of integrating audio and visual data to enhance model robustness and performance. The novel VAAR dataset and the AVAR-Net framework together offer valuable resources and methodologies for future exploration. Practical implications include improved surveillance effectiveness and wider deployment possibilities in safety-critical environments. Future research directions might explore further dataset expansion and the development of adaptive and real-time capable algorithms to extend the applicability of multimodal recognition systems.

Markdown Report Issue