VAAR Dataset Overview

Updated 27 December 2025

VAAR Dataset is a dual-purpose resource comprising retinal vessel segmentation and visual-audio anomaly recognition datasets for distinct research applications.
The retinal VAAR dataset includes 208 high-resolution fundus images with detailed artery/vein annotations, enabling robust vascular analysis and clinical model evaluation.
The audio-visual VAAR dataset offers 3,000 balanced video clips with synchronized audio for benchmarking multimodal anomaly detection in surveillance environments.

The VAAR Dataset refers to two distinct, domain-specific datasets, each named for its intended application but differing radically in content and scope. One is the Vessel ARtery–vein Annotation and Research dataset from retinal imaging, developed for the advancement of automated vascular analysis in ophthalmology (Quiros et al., 19 Dec 2025). The other is the Visual-Audio Anomaly Recognition benchmark, curated for robust multimodal anomaly detection research in real-world video surveillance applications (Ali et al., 15 Oct 2025). Terminological ambiguities arise due to the coincident acronyms; accordingly, this entry presents a precise elaboration of both datasets in their authoritative contexts.

1. Vessel ARtery–vein Annotation and Research (VAAR) Dataset

Dataset Overview

The VAAR (Vessel ARtery–vein Annotation and Research) Dataset is a high-resolution, richly annotated corpus of 208 retinal color fundus images (CFIs), standardized to 1024×1024 pixels and derived from the longitudinal Rotterdam Artery–Vein (RAV) dataset. Sixty percent of the images are optic-disc-centered and 40% are macula-centered, with balanced left/right eye representation. Images originate from eight distinct fundus imaging systems, encompassing a diversity of field-of-view angles (35°, 45°), optical properties, and device manufacturers. The dataset provides three modalities per image: the original RGB fundus image, a contrast-enhanced version, and an RGB-encoded artery/vein/unknown segmentation mask.

Annotation Methodology

A semi-automated annotation workflow is adopted. Initial binary vessel masks, generated by a pre-trained deep network, are subjected to multi-layer manual correction in a custom interface. Graders correct false positives and false negatives, then annotate arteries, veins, and "unknown" vessel segments as independent layers, exploiting freehand drawing tools. At vessel crossings, both the artery and vein layers are labeled, supporting accurate vascular topology. Each connected component in a given mask layer is uniquely colorized for rapid validation, and graders ensure that the resulting vascular graph $G=(V,E)$ for each class comprises a single tree structure per branch ( $|CC(G)|=1$ ). Degree distributions at bifurcation points, $p_d = |{v\in V:\deg(v)=d}|/|V|$ , are used for topological validation. Unclassifiable vessels, typically due to low contrast or imaging artifacts, are consistently segregated into the "unknown" layer.

Ground-Truth Encoding Scheme

Segmentation masks utilize a strict color-encoding convention:

Red channel $(R=255, G=B=0)$ : artery
Blue channel $(B=255, R=G=0)$ : vein
Green channel $(G=255, R=B=0)$ : unknown vessel
Black $(R=G=B=0)$ : background

Pixelwise class assignment is established via: $M(x,y) = \begin{cases} 0, & \text{if } R(x,y)=G(x,y)=B(x,y)=0\, (\text{background}) \ 1, & \text{if } R(x,y)=255\, (\text{artery}) \ 2, & \text{if } B(x,y)=255\, (\text{vein}) \ 3, & \text{if } G(x,y)=255\, (\text{unknown}) \ \end{cases}$

This pixel-class mapping enables direct compatibility with segmentation loss functions and evaluation pipelines.

Image Quality and Heterogeneity

Algorithmic image-quality scores, normalized to $[0,2]$ , span a broad range ( $Q\approx0.4–1.8$ ), capturing considerable variability. High-quality samples ( $Q>1.4$ ) exhibit pronounced vessel boundaries with minimal artifacts. Lower-scoring images ( $Q<0.6$ ), often resulting from analog or OCT fundus capture, present challenges such as vignetting and overexposure but are retained for their clinical representativity. The prevalence of these variable-quality images distinguishes VAAR from prior datasets typically censored for quality. Annotation reliability remains high (mean inter-rater Dice $\approx0.90$ ), but low-quality regions drive greater use of the "unknown" class.

Dataset Structure, Splits, and Licensing

A recommended stratified split comprises $70\%$ of images for training (approximately 145 samples), $15\%$ for validation (31 samples), and $15\%$ for testing (32 samples), with public test data (53 images, CC0-licensed) available for open benchmarking. The dataset is accompanied by extensive metadata (age, device, sex, laterality, centering, quality score) and partition scripts. Licensing permits unrestricted academic use of 53 images under CC0 1.0, with the remaining 155 images accessible via a non-commercial EULA.

Potential Applications and Baselines

VAAR supports:

Semantic artery/vein segmentation
Topological connectivity and tortuosity analysis
Vascular biomarker computation (e.g., diameter, branching angles)
Robustness evaluation across device and quality variations

Human baseline metrics on a subset yield Cohen’s $\kappa_{\text{artery}} = 0.882$ , $\kappa_{\text{vein}} = 0.890$ , and mean Dice scores of $0.906$ (artery), $0.899$ (vein), corresponding to $F1\approx0.90$ for both classes. These values serve as reference points for model evaluation (Quiros et al., 19 Dec 2025).

Property	Details
Total images	208 (53 public, 155 on-request)
Modalities	RGB, contrast-enhanced, AV masks (3-channel PNG)
Image size	1024×1024 px
Segmentation labels	Artery, vein, unknown, background
Licensing	CC0 1.0 (53), EULA (155)

2. Visual-Audio Anomaly Recognition (VAAR) Dataset

Dataset Composition

The Visual-Audio Anomaly Recognition (VAAR) dataset comprises exactly 3,000 video clips with synchronized raw audio, constructed as a medium-scale, real-world benchmark for audio-visual anomaly detection. Each video belongs to one of ten anomaly categories, selected for relevance to public safety and surveillance:

abuse
baby cry
crash
brawling
explosion
intruder
normal
pain
police siren
vandalism

The classes are strictly balanced: $N_i=300$ per class ( $p_i=0.10$ ). Clips are stored in class-specific folders, named following the <class>_XXX.mp4 convention ( $001\leq XXX \leq 300$ ).

Data Acquisition and Annotation Protocol

Approximately 4,000 candidate videos were crawled from diverse public sources (YouTube, Google Video, TikTok, Twitter, Facebook, news). A three-tier annotation pipeline consisted of: (1) team-level pre-filtering and clip selection, (2) temporal segmentation using Bandicut (5–120 s per clip, ensuring event coherence), and (3) cross-annotator validation for label consistency. While no explicit inter-annotator reliability statistics are supplied, the protocol emphasizes maximization of labeling precision and minimization of subjectivity. Fine-grained frame-level labeling is not provided; all labels apply to full clips.

Data Format and Statistics

Video: Predominantly 720×1080 pixels (16:9 aspect), H.264 MP4, frame rates between 24–30 fps.
Audio: 16 kHz, 16-bit mono PCM.
Duration: 5–120 seconds per clip, with mean $\mu_L \approx62.5$ s and standard deviation $\sigma_L \approx34.7$ s (as representative, not exact, statistics).

Benchmark Protocols and Baseline Metrics

Experiments reported for AVAR-Net and comparable baselines employ multiclass classification metrics:

$\mathrm{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}$

$\mathrm{Precision} = \frac{TP}{TP + FP}$

$\mathrm{Recall} = \frac{TP}{TP + FN}$

$\mathrm{F1} = 2 \times \frac{\mathrm{Precision} \times \mathrm{Recall}}{\mathrm{Precision} + \mathrm{Recall}}$

$\mathrm{AP} = \sum_{n=1}^N(R_n - R_{n-1})P_n$

Evaluation of AVAR-Net on VAAR yields:

Accuracy: $89.29\%$
F1-score: $89.22\%$
Gains from early audio-visual fusion compared to video-only MTCN ( $77.50\%$ ) and audio-only MTCN ( $72.10\%$ )

Use Cases and Limitations

Application domains include surveillance (violence, intrusions), emergency response (explosions, sirens), transportation (crashes), and healthcare (pain signals). Limitations include:

Source bias: ~80% CCTV footage, 20% movies/user content
Annotations at the clip level, without framewise labels
Privacy safeguarded—no deliberate collection of personally identifiable information; all data conforms to fair-use guidelines

Property	Details
Total clips	3,000
Classes	Ten, perfectly balanced ( $n=300$ each)
Video format	720×1080 px, H.264, 24–30 fps
Audio format	16 kHz, 16-bit mono PCM

3. Scientific Impact and Benchmark Positioning

Both VAAR datasets address notable gaps in their respective domains. Retinal VAAR offers the first collection of connectivity-validated, high-resolution A/V segmentation masks over a broad quality spectrum, enabling the development of robust, clinically relevant ML algorithms for vascular quantification and disease screening. The audio-visual VAAR dataset enables systematic benchmarking of multimodal anomaly detection algorithms, with class balance and content diversity tailored to the requirements of real-world event detection and public safety systems.

4. Access, Licensing, and Reproducibility

The retinal VAAR dataset is available via Dataverse (doi:10.34894/9OIMWY). Fifty-three images are released under CC0 1.0; the remainder requires a research EULA. The dataset includes a comprehensive metadata CSV and standardized Python scripts for train/validation/test splits, supporting consistent experimental protocols (Quiros et al., 19 Dec 2025).

The audio-visual VAAR dataset, as published alongside AVAR-Net (Ali et al., 15 Oct 2025), is structured in class-based folders and was curated exclusively from public, fair-use media; privacy and ethical safeguards are observed.

5. Limitations, Biases, and Recommended Usage

For the retinal VAAR, the inclusion of low-quality and highly heterogeneous samples supports model development under realistic deployment conditions but introduces annotation challenges, particularly regarding "unknown" vessel classes and device-dependent artifacts. Users are advised to exploit the recommended stratified split and employ connectivity measures for evaluation.

The audio-visual VAAR dataset is susceptible to source bias due to the predominance of stationary CCTV footage and the lack of frame-level annotation, which may limit the granularity of temporal anomaly detection. Researchers should consider these constraints in domain adaptation or fine-grained event localization studies.

6. Future Directions

Prospects for VAAR datasets extend to:

Multimodal learning (e.g., joint vessel and anomaly detection across domains)
Unsupervised domain adaptation across devices (retinal VAAR) or source types (audio-visual VAAR)
Topologically informed vessel graph analytics
Realistic anomaly detection in unconstrained public environments, with synthetic data augmentation to address domain imbalance and privacy preservation

Both datasets present high-quality, authoritative benchmarks with strong baseline metrics, designed to catalyze further methodological advances in their respective domains (Quiros et al., 19 Dec 2025, Ali et al., 15 Oct 2025).

PDF Markdown Chat (Pro)

References (2)

Rotterdam artery-vein segmentation (RAV) dataset (2025)

AVAR-Net: A Lightweight Audio-Visual Anomaly Recognition Framework with a Benchmark Dataset (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to VAAR Dataset.