VERSA Toolkit: Audio & Video Evaluation
- VERSA Toolkit is a dual-domain system offering configurable evaluation for audio signals via Python and video event detection via Prolog.
- It supports over 65 audio metrics and leverages declarative spatial-temporal logic for robust, reproducible assessments in research and surveillance.
- Its extensible architecture enables rapid prototyping, integration into diverse pipelines, and scalable benchmarking for both audio and video applications.
VERSA is a recurring acronym for toolkit systems in both audio and video domains. While sharing the aim of streamlining complex evaluation or recognition tasks, these toolkits diverge strongly in technical approach, target domain, and architectural substrate. The two most influential VERSA systems are (1) the “Versatile Evaluation Toolkit for Speech, Audio, and Music”—a Python-based meta-metrics framework for unified audio and music evaluation—and (2) the “Video Event Recognition for Surveillance Applications”—a Prolog-based declarative logic engine for video event detection in surveillance. Both are widely referenced in benchmarks and challenge-winning model pipelines (Shi et al., 2024, Yamamoto et al., 5 Dec 2025, O'Hara, 2010).
1. Architectural Overview
Audio/Music Evaluation: VERSA (2024–2025)
The audio/music VERSA toolkit is implemented as a Python package (versa) targeting the rapid, reproducible computation of a wide variety of objective metrics on speech, audio, and music signals. Its central features include a unified YAML configuration interface, strict but extensible dependency management, and coverage of over 65 metrics with 729+ supported variations through backend and model selection. By abstracting metric configuration from programmatic execution, it enables users to benchmark systems in speech coding, TTS, speech enhancement, singing synthesis, and music generation using a single unified framework (Shi et al., 2024).
Video Event Recognition: VERSA (2010)
The surveillance VERSA toolkit is architectured as a logic-programming engine that ingests XML or database annotations from low-level detection modules and reasons about high-level event semantics using declarative spatial and temporal logic. Its Prolog-based core separates spatial-primitive computation (e.g., entity bounding boxes, centroids, track IDs) from symbolic event templates and interval-set temporal predicates, yielding a service-oriented, extensible system that integrates easily into heterogeneous surveillance infrastructures (O'Hara, 2010).
2. Supported Metrics and Predicates
Audio/Music Variant: Metric Classes
VERSA groups evaluation metrics into four main categories (Shi et al., 2024):
- Independent metrics: No reference audio required. Ex: DNSMOS P.835, NISQA, UTMOS, Torch-Squim scores.
- Dependent metrics: Require matching reference audio. Ex: SDR, PESQ, STOI, SI-SNR, MCD, F0-RMSE.
- Non-matching reference metrics: Ex: NORESQA, NOMAD, alignment metrics, Whisper-WER, CLAP-Score.
- Distributional metrics: Dataset-level statistics. Ex: Fréchet Audio Distance (FAD), KID, coverage, KLD.
The audio VERSA exposes all metrics through Python APIs (versa.models.<metric>.predict) or CLI, encodes configuration in human-editable YAML, and supports parameter tuning (e.g., backend model selection, feature windowing) directly in the config.
Video Variant: Annotations and Logic Predicates
The surveillance VERSA formalizes supervised event recognition as logic inference over “ground facts,” spatial predicates, and interval-set temporal operators (O'Hara, 2010):
- Ground facts per frame:
exists,loc,bounds,orient,type - Spatial predicates:
near,overlapping,inside,outside,above,leftOfwith definitions via bounding box geometry (e.g., , ) - Temporal predicates: Allen’s interval algebra (e.g.,
before,during,meets,overlaps) - Templates: Frame templates (entity and relation constraints per frame); Event templates (multi-frame, temporal constraints).
The Prolog logic enables concise event definitions, with predicates efficiently cached at parse-time for lookups during query.
3. Workflow and Configuration
Audio/Music: Batch and Programmatic Scoring
- Preparation: Audio file lists per system, ground truth/reference files as required.
- Metric configuration: YAML files specify metrics, model backends, thresholding, and aggregation.
- Execution: Metric computation via CLI (
scorer.py), Python API (Scorer.score). Output is per-utterance, per-metric JSON or CSV, suitable for tabular aggregation. - Extensibility: New metrics are implemented as subclasses of
versa.metrics.base.Metric. Optional dependencies and models are handled through extras scripts, and a resource cache downloads and reuses external assets. - Large-scale evaluation: SLURM templates and result aggregation utilities are provided for computational scaling and report generation (Shi et al., 2024).
Example (AudioMOS Challenge 2025: T12 System)
The T12 system used a subset of VERSA’s reference-free metrics (DNSMOS, NISQA, UTMOSv2, Torch-Squim, PAM, baseline Audiobox-Aesthetics), stacking their normalized scores into a 28-D feature vector per utterance. These were standardized (z-score normalization), with clipping per-metric as specified by each metric’s paper. Features were fed to an XGBoost regressor with hyperparameters optimized by Optuna and validated via 10-fold cross-validation, predicting axes of audio aesthetics. Ablation revealed that while the VERSA-only model slightly trailed KAN models in SRCC, ensemble stacking yielded improved SRCC and lower MSE on some axes, confirming complementary error structure (Yamamoto et al., 5 Dec 2025).
Video: Declarative Event Query and Monitoring
- Integration: Real-time consumption of XML (e.g., CVML) streams from detection engines, with ground fact assertion.
- Configuration: GUI and scripting interfaces for defining templates (drawing entity arrangements, temporal structure); batch loading via REST or CLI.
- Detection: Periodic query evaluation for all active event templates; efficient interval-set algebra supports persistent and transient event semantics, with support for smoothing and tolerance (e.g., 1D morphological closing to fill detection gaps).
- Customization: Spatial and temporal predicates are extendable; fuzzy versions replace crisp thresholds for soft matching; Prolog extensibility integrates with fuzzy logic and constraint packages in SWI-Prolog.
4. Mathematical Formalism and Metric Definitions
Audio/Music Metric Examples
The toolkit specifies many widely used mathematical forms:
- Signal-to-Distortion Ratio (SDR):
- Scale-Invariant Signal-to-Noise Ratio (SI-SNR):
- Perceptual Evaluation of Speech Quality (PESQ): Defined per ITU-T P.862 as a non-linear combination of time-aligned reference/distortion, auditory filterbanks, and mapped to MOS.
The toolkit adheres to metric-specific best practices—score clipping, normalization, backend selection—encoded directly in the APIs or YAML config (Shi et al., 2024, Yamamoto et al., 5 Dec 2025).
Video Event Template Formalism
- Spatial: Entity-centric, pixel-coordinate predicates with configurable tolerances.
- Temporal: Discrete intervals, with interval algebra supporting all Allen relations; interval sets and Prolog predicates enable efficient multi-entity, multi-span event formulations.
By allowing fuzzy logic and parameterized match scores, the VERSA prototype supports both binary and graded event recognition (O'Hara, 2010).
5. Applications and Benchmark Impact
Audio/Music
- Neural audio coding: SDR, PESQ, UTMOS, Speaker-Sim scripts allow side-by-side codec evaluation, aggregating over standard or custom-defined metrics.
- Speech synthesis and enhancement: MOS predictors (UTMOS, PLCMOS), intelligibility metrics (STOI, Whisper-WER), and reference-based/less metrics facilitate both subjective and objective system comparisons.
- Singing synthesis and music generation: Combines speech and music domain metrics; supports both prompt-to-audio (e.g., PAM, CLAP) and distributional metrics (FAD, KID), relevant for generative and large-scale evaluation.
- Benchmarks and competitions: Used in baselines and top-performing submissions, e.g., the T12 system for AudioMOS Challenge 2025 integrated 28-D VERSA features with tree ensemble regressors, yielding state-of-the-art SRCC (Yamamoto et al., 5 Dec 2025).
Video Surveillance
- Event detectors: Left-behind items, loitering, and other behaviors via declarative multi-frame templates, supporting uncertainty and scene-specific customizations.
- Interoperability: XML-SOA model enables deployment across analytics engines and heterogeneous camera networks.
- Extensibility: Users define new events by sketching spatial/temporal constraints, with the Prolog back-end supporting rapid prototyping and fuzzy extensions (O'Hara, 2010).
6. Extensibility and Best Practices
Audio/Music
- Adding metrics: Subclass
versa.metrics.base.Metric, register in core package, supply YAML config, add unit tests. - Resource management: Downloads and caches required models or embeddings, reducing repeated overhead.
- Parallelization: SLURM templates and scriptable CLI ensure scalability for large dataset evaluation.
- Configuration versioning: YAML structure promotes reproducibility and facilitates metric variant exploration (Shi et al., 2024).
Video
- Logic-driven extensibility: Arbitrary new predicates and event definitions can be instantiated by writing new Prolog rules or augmenting existing spatial/temporal semantics.
- Fuzzy logic integration: Replacement of crisp predicates (e.g.,
near) with fuzzy degree outputs; supports gradual, probabilistically-informed alerting. - Integration: Compatible with standard annotation schemas (CVML, MPEG-7, VEML), increasing portability.
VERSA toolkits have established themselves as reference implementations in both symbolic video event recognition and large-scale, unified audio metric benchmarking. Their architectural principles—declarative configuration, extensibility, and abstraction across backend detail—enable reliable, reproducible evaluation and event analysis in research and production settings (O'Hara, 2010, Shi et al., 2024, Yamamoto et al., 5 Dec 2025).