AnomVerse: Visual & Cluster Anomaly Datasets

Updated 25 December 2025

AnomVerse is a collection of two specialized datasets serving as benchmarks for visual anomaly synthesis and time-series anomaly analysis.
The visual dataset offers 12,987 multimodal triplets with pixel-wise masks and LLM-generated captions, enabling zero-shot anomaly generation.
The compute cluster dataset records over 1 million time-series metrics from HPC nodes to support unsupervised anomaly detection and system monitoring research.

AnomVerse refers to two distinct datasets recognized in the research literature—one in the context of visual anomaly synthesis for anomaly detection, and one in the domain of operational monitoring for compute clusters. Both datasets are independently titled "AnomVerse" and serve as specialized benchmarks for their respective research communities, supporting machine learning development in zero-shot anomaly generation and time-series anomaly analysis.

1. Visual Anomaly Synthesis: AnomVerse in Anomaly Generation

AnomVerse, as introduced in "Anomagic: Crossmodal Prompt-driven Zero-shot Anomaly Generation" (Jiang et al., 13 Nov 2025), is a multimodal benchmark dataset designed for the supervised and zero-shot training of generative anomaly models. The primary goal is to provide large-scale, captioned, mask-aligned images for developing and evaluating crossmodal anomaly generation pipelines.

The dataset consists of 12,987 triplets, where each sample comprises an RGB image with a defect (anomaly), a pixel-wise binary mask indicating the anomaly region, and a machine-generated descriptive caption. Source material is aggregated from 13 publicly available datasets in the industrial, textile, consumer goods, medical, and electronics domains. Domain coverage, based on sample contribution, is 56.5% industrial, 23.6% textiles, 8.7% consumer goods, 5.9% medical, and 5.3% electronics, with a total of 131 unique defect types. Each sample contains exactly one defect mask, and the mask's complement implicitly defines normal (defect-free) regions.

For training generative models, images and masks are resized to $512 \times 512$ pixels, and masks undergo morphological dilation to prepare inpainting regions. Captions are generated by a multimodal LLM (Doubao-Seed-1.6-thinking), processing structured visual hints (bounding boxes around defect regions) and employing the following template:

"The image depicts [object], with a [defect type] observed [location]. The defect is characterized by [description] and exhibits [notable features]."

No human annotation or curation is performed; all captions are accepted as output by the LLM. Tokenization for captions is deferred to CLIP, which allows efficient integration into prompt-based diffusion and transformer models.

2. Annotation Format and File Organization

Each AnomVerse record is a triplet:

I_ref: RGB anomaly image (e.g., 512×512 PNG or JPEG).
M_ref: Single-channel binary mask (same dimensions; $0=$ background, $1=$ defect).
t_ref: Caption (UTF-8 plain text, 4-segment template).

Although explicit filename conventions are not mandated in the original publication, repository structure typically partitions data into directories by type: images, masks, and captions, indexed identically. Masks serve dual purposes: as ground truth for evaluation and as input for region extraction in captioning and generative workflows.

Captions are stored as plaintext, with tokenization and segmentation handled at model input (using the CLIP tokenizer's 77-token limit with hierarchical encoding as needed). No manual postprocessing or filtering of captions is reported.

3. Construction and Quality Control

Data is collected by normalizing all available defect mask formats (polygon or raster) into standardized binary pixel-wise masks at uniform spatial scales. Visual semantic alignment between mask and caption is enforced by using the mask to guide the LLM's region-of-interest input and template-based captioning.

Quality assurance is restricted to LLM-template consistency; no manual quality checks or crowdsourced label corrections are performed. The training split includes samples from 11 datasets, holding out MVTec AD and VisA for zero-shot evaluation. Category balance across defect types is uneven, with a strong industrial skew reflecting the public datasets selected.

Table 1 summarizes key AnomVerse visual dataset properties:

Property	Value / Description
Total triplets	12,987
Image size	512 × 512
# Defect types	131
Domains	Industrial, textiles, consumer, medical, electronics
Caption source	Doubao-Seed-1.6-thinking (LLM)
Training split	11/13 datasets (excl. MVTec AD, VisA)

This table succinctly organizes the most salient factual dataset descriptors explicitly reported in (Jiang et al., 13 Nov 2025).

4. Integration, Usage, and Code Example

Repository code for AnomVerse and the associated Anomagic generative model is available at https://github.com/yuxin-jiang/Anomagic. The dataset is designed for deep learning frameworks that include multi-modal input (e.g., HuggingFace Diffusers):

from diffusers import StableDiffusionInpaintPipeline
import torch

pipe = StableDiffusionInpaintPipeline.from_pretrained(
    "yuxin-jiang/Anomagic", torch_dtype=torch.float16
).to("cuda")

images = load_images("./AnomVerse/images/")
masks = load_masks("./AnomVerse/masks/")
captions = load_texts("./AnomVerse/captions/")

outputs = pipe(
    prompt=captions, image=images, mask_image=masks, num_inference_steps=20
)

Captions can be arbitrarily long during dataset creation, but CLIP tokenization imposes a sequence limit at inference and training time. License information is not detailed in the primary publication, but the repository is presumed to contain an open-source license (MIT/Apache); users are instructed to cite (Jiang et al., 13 Nov 2025), and commercial use is subject to the originating datasets' terms.

5. Compute Cluster Anomaly Detection: AnomVerse in Time-Series Analysis

A distinct dataset with the same name, "AnomVerse," is presented in "Dataset for Investigating Anomalies in Compute Clusters" (McSpadden et al., 2023). This dataset is a large-scale, multi-metric time-series corpus collected from 332 high-performance computing (HPC) cluster nodes at Jefferson Lab, targeting research in automated anomaly detection, root-cause analysis, and digital twin development.

Captured over a five-day period (May 19–23, 2023), AnomVerse includes:

Normal regime: May 19–22 (baseline).
Anomaly event: May 23 (major IT intervention event).

Per-node telemetry covers four metric groups: CPU (8 counters), disk (11 counters), memory (47 gauges), and Slurm job metrics (22 time series from 4 files × multiple states). Data volume exceeds 1 million records (~180 GB).

Data schema employs CSV files with the following common columns: index, __name__, [dimension columns], timestamp, value Key dimensions encompass instance (node or exporter), metric mode/status (e.g., CPU mode, disk device, Slurm state), and sampling timestamp (RFC3339). Granularity is 60 s for hardware metrics, 30 s for Slurm schedulers.

Example table of metric schema:

Category	Metric Count	Sample Freq	Unit(s)
CPU	8	60 s	seconds (cumulative)
Disk	11	60 s	counts, bytes, seconds
Memory	47	60 s	bytes
Slurm	22	30 s	counts (CPUs in state)

6. Anomaly Identification Protocol and Limitations

Cluster-wide anomaly characterization marks May 23 as potentially anomalous but provides no explicit ground-truth at the per-node or per-metric level. Instead, users may calculate per-metric z-scores:

$z_k(t) = \frac{x_k(t) - \mu_k}{\sigma_k}$

where $x_k(t)$ is a time-series value for metric $k$ at time $t$ , and reference statistics $(\mu_k, \sigma_k)$ are computed from the May 19–22 baseline. Samples exceeding a threshold $|\tau|$ (e.g., $|z_k| > 3$ ) may be provisionally flagged as anomalous. A plausible implication is that the dataset is best suited for unsupervised or semi-supervised anomaly detection algorithms where explicit instance-level anomaly annotation is absent. Data cleaning is necessary owing to expected gaps and sampling jitter; normalization (rate conversion, min-max, or z-score scaling) is recommended for comparability across node families.

7. Applications, Preprocessing, and Research Usage

Both variants of AnomVerse have been used to benchmark methodologies aligned with their data modalities:

Visual AnomVerse: Training and evaluating crossmodal generative models for anomaly synthesis, including zero-shot prompt-driven synthesis and diffusion-based inpainting (Jiang et al., 13 Nov 2025).
Cluster AnomVerse: Benchmarking unsupervised anomaly detection (e.g., variational autoencoders, Gaussian mixture models), time-series prediction (LSTM, Transformer-based models), root-cause inference, and digital twin calibration (McSpadden et al., 2023).

Typical preprocessing and experimental tasks for the cluster time-series dataset include imputation (forward-fill, linear interpolation), aggregation (windowed statistics), and transformation to rates or standardized scores. Researchers have proposed extensions including network telemetry, synthetic anomaly injection, long-duration recording, and integration with job scheduling metadata.

The availability of comprehensive, multi-domain, large-scale anomaly datasets under the AnomVerse name underpins ongoing advances in both visual anomaly synthesis and operational anomaly detection, and each is positioned as a community benchmark in its respective research area (Jiang et al., 13 Nov 2025, McSpadden et al., 2023).