VCUBench: Multi-Domain Benchmark Suite

Updated 21 November 2025

VCUBench is a set of benchmark suites that define standardized evaluation protocols across diverse domains such as cache timing vulnerability, quantum circuit volumetrics, and visual concept unlearning.
Each suite features automated test generation, clear experimental methodologies, and precise, domain-specific metrics to enable reproducible and quantitative cross-system comparisons.
Empirical findings using VCUBench guide design improvements in hardware security, quantum device performance, and ethical machine unlearning, fostering robust, community-driven research.

VCUBench refers to multiple, independently developed benchmark suites designed for distinct research domains, each providing community-driven, reproducible standards for evaluating advanced models or systems. Published VCUBench suites address: (1) cache side-channel vulnerability benchmarking in computer architecture, (2) quantum processor volumetric capacity benchmarking, and (3) group-context visual concept unlearning in multimodal LLMs. Each instantiation defines both the technical scope and experimental methodology for quantitative, cross-system comparison.

1. VCUBench for Cache Timing Vulnerability Benchmarking

VCUBench is the first fully-automated benchmark suite to comprehensively enumerate and test all timing-based cache side- and covert-channel vulnerabilities in modern L1 data-cache designs (Deng et al., 2019). This enables architecture researchers to characterize, compare, and improve hardware security guarantees.

Overview and Motivation

Side-channel attacks such as Prime+Probe, Flush+Reload, and SpectrePrime exploit timing differences in cache operations, threatening confidentiality. Prior work chiefly cataloged attacks in an ad hoc manner without systematic enumeration or evaluation protocols. VCUBench fills this gap, deriving 88 theoretically possible “Strong” vulnerability patterns using a refined three-step attack model, and automatically emitting 1,094 distinct C benchmarks to probe them.

Three-Step Attack Model

VCUBench’s attack modeling abstracts every cache line timing-leak as a sequence:

S1: Preparation (e.g., load, store, flush, invalidate)
S2: Interference (acts by attacker or victim)
S3: Timing Observation (through load, store, etc.)

Each action may occur in 17 microarchitectural cache-line states (across local/remote L1–L3, clean/dirty, DRAM), expanding to 4,913 possible step triples; 88 yield strong, attacker-recoverable leaks as determined by modeling and simulation.

Vulnerability Enumeration and Test Instantiation

The 88 “Strong” vulnerabilities are categorized by interference type (internal vs. external) and leakage granularity (address-based, set-based, or combined). Each is mapped to C code templates that control executing threads (attacker, victim), access patterns, thread binding (time-sliced or hyperthreaded), and observation primitives (load, store, coherence invalidation).

Table: Example Vulnerability Patterns

Index	Step Triple	Category	Canonical Attack Name
1	A^{{inv}→V_u→V_a}	I–A	Cache Collision
87	A_d→V_u→A_d^{inv}	E–S	Prime+Probe Inv.
88	A_a→V_u→A_a^{inv}	E–SA	Prime+Probe Inv.

Cache Timing Vulnerability Score (CTVS)

The aggregate Cache Timing Vulnerability Score is defined as the fraction of strong vulnerabilities observed in practice:

$\text{CTVS}(M) = \frac{1}{88} \sum_{i=1}^{88} v_i(M)$

where $v_i(M) = 1$ if pattern $i$ is exploitable on target $M$ , else 0. Lower CTVS indicates a more secure cache. Scores are directly comparable across machines, RTL prototypes, and simulators.

Empirical Findings and Guidance

Real hardware tests on Intel and AMD platforms reveal only ~50–80 vulnerabilities per chip manifest in practice.
Hyperthreaded configurations expose more channels than time-sliced.
Write/Store-based and coherence-invalidation leakages are as prevalent as load/flush-based ones.
Strong CTVS dependence observed: e.g., AMD FX-8150 exhibits 0.57, Intel Xeon E5-1620 up to 0.83.

Design recommendations include prioritizing store-timing defenses, coherence randomization, hyperthread partitioning, and multi-state latency homogenization. Iterative CTVS-guided simulation is advised for design closure (Deng et al., 2019).

2. VCUBench as a Volumetric Quantum Benchmark

In the quantum computing domain, VCUBench manifests as a volumetric benchmark—a parameterized, multidimensional extension of the quantum volume paradigm designed for broad cross-platform quantum characterization (Blume-Kohout et al., 2019).

Framework Definition

A volumetric benchmark (VB) defines a map

$(w, d) \mapsto \mathcal{C}(w, d)$

with

$w$ : circuit width (qubits)
$d$ : circuit depth (logical layers)
$\mathcal{C}(w, d)$ : an ensemble of test circuits at given shape

Each test suite specifies native compilation, per-circuit success criteria (e.g., heavy-output probability), aggregation methods, and experiment design.

VCUBench Circuit Family

VCUBench adopts the quantum volume/IBM-style random circuit family:

1
2
3

for ℓ=1 to d:
  if ℓ odd: random permutation π_ℓ ∈ S_w
  if ℓ even: pairwise SU(4) random gates on ⌊w/2⌋ disjoint pairs

A fixed number of circuits (e.g., K=20 per (w,d)) is tested with 2000 shots each. Success requires the average heavy-output probability to reach 2/3 statistical threshold.

Pareto Frontier and Reporting

Results are plotted on the (width, depth) grid, with the Pareto frontier $d_\text{max}(w)$ indicating maximal circuit depth per width for successful execution. Overlaying classical noise models enables deviation analysis.

VCUBench ensures open, cross-platform reproducibility via strict random-seed logging, public circuit generators, and clearly codified passing rules (Blume-Kohout et al., 2019).

3. VCUBench for Visual Concept Unlearning in Multimodal LLMs

A recent instantiation of VCUBench (unrelated to the aforementioned uses) targets the evaluation of machine unlearning (MU) of fine-grained visual concepts in multimodal LLMs (MLLMs) (Chen et al., 14 Nov 2025).

Motivation and Benchmark Structure

Contemporary regulatory and ethical constraints (e.g., “right to be forgotten”) mandate mechanisms for efficient, precise erasure of sensitive visual content from LLMs. Existing MU benchmarks are text-only or use unrealistic setups. VCUBench addresses this by curating real-world multi-person and individual images of five public figures, each annotated with 20 VQA queries probing identity, spatial relations, and commonsense context.

Tasks and Splits

Target-Single: Solo images of each figure.
Target-Group: Group images containing the figure alongside others.
Non-Target-Single / Group: Controls for collateral forgetting and retention.

A held-out test set ensures evaluation of model changes without retraining artifacts.

Metrics

VCUBench defines six core quantitative metrics:

Target Forgetting Accuracy (TFA):

$\text{TFA} = \frac{N_{\text{non-}c_t}}{N_{c_t}}$

Non-Target Retain Accuracy (NTRA):

$\text{NTRA} = \frac{N_{\text{correct-}c_{\neg t}}}{N_{c_{\neg t}}}$

Group Retain–Forget F1 (GRF-F1):

$\text{GRF-F1} = \frac{2 \cdot \text{TFA} \cdot \text{NTRA}}{\text{TFA} + \text{NTRA}}$

Efficacy on Single-Person (EFF): Fraction of correct erasure in singles
Generality: Held-out ScienceVQA accuracy for side-effect analysis
Perplexity (PPL): Text fluency (non-penalizing for forgotten tokens)

Protocol

All MU methods operate on frozen base models (e.g., LLaVA-1.5) with parameter-efficient adapter tuning. Baselines include gradient ascent on targets, preference optimization for abstention, gradient ascent + KL regularization, and state-of-the-art selective unlearning (SIU). The AUVIC framework applies adversarial perturbations and Gumbel-Softmax sampling for collateral retention.

Key Findings

AUVIC achieves TFA above 93%, high NTRA (above 80%), and strongly balanced GRF-F1 (~88%). Performance ablates significantly without adversarial perturbation or anchor sampling. AUVIC preserves general VQA and text fluency, demonstrating the importance of joint design for precision forgetting and retention (Chen et al., 14 Nov 2025).

4. Technical and Experimental Methodology Unification

Across all manifestations, VCUBench comprehensively enumerates the relevant threat/model/task space, constructs representative and challenging real-world test instances, and provides automated evaluation harnesses for cross-platform, cross-research-group reproducibility.

Common Principles

Enumeration: Theoretical completeness (e.g., all strong vulnerabilities, all (w,d) circuit shapes, all group-member settings).
Automated Generation: Code/data synthesis covering the span of test patterns.
Metric Formalization: Domain-appropriate, scalar yet decomposable performance metrics (e.g., CTVS, Pareto frontier, TFA/GRF-F1).
Open Sourcing: Reference implementations and canonical data splits.

5. Impact and Future Directions

VCUBench benchmarks have direct impact on hardware security, quantum device evaluation, and trustworthy multimodal model design. As their domains converge, advances in model unlearning, circuit reliability, and side-channel resilience increasingly inform—and are informed by—community benchmark suites.

Cache Security: VCUBench guides practical defense design, enabling architects to quantitatively close major timing leak channels (Deng et al., 2019).

Quantum Computing: Pareto frontier visualizations and test reproducibility from VCUBench underpin fair cross-device comparison, standards compliance, and experimental reproducibility (Blume-Kohout et al., 2019).

Machine Unlearning: As regulatory mandates grow, VCUBench establishes an empirical baseline for selective, precise, and reproducible identity erasure in vision-LLMs (Chen et al., 14 Nov 2025).

Planned future developments include extending VCUBench to additional modalities, richer feature sets, more scalable anonymization, robust simulation-based protocols, and human–AI competitive evaluation (Blume-Kohout et al., 2019, Chen et al., 14 Nov 2025).

PDF Markdown Chat (Pro)

References (3)

A Benchmark Suite for Evaluating Caches' Vulnerability to Timing Attacks (2019)

A volumetric framework for quantum computer benchmarks (2019)

AUVIC: Adversarial Unlearning of Visual Concepts for Multi-modal Large Language Models (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to VCUBench.

VCUBench: Multi-Domain Benchmark Suite

1. VCUBench for Cache Timing Vulnerability Benchmarking

Overview and Motivation

Three-Step Attack Model

Vulnerability Enumeration and Test Instantiation

Cache Timing Vulnerability Score (CTVS)

Empirical Findings and Guidance

2. VCUBench as a Volumetric Quantum Benchmark

Framework Definition

VCUBench Circuit Family

Pareto Frontier and Reporting

3. VCUBench for Visual Concept Unlearning in Multimodal LLMs

Motivation and Benchmark Structure

Tasks and Splits

Metrics

Protocol

Key Findings

4. Technical and Experimental Methodology Unification

Common Principles

5. Impact and Future Directions

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

VCUBench: Multi-Domain Benchmark Suite

1. VCUBench for Cache Timing Vulnerability Benchmarking

Overview and Motivation

Three-Step Attack Model

Vulnerability Enumeration and Test Instantiation

Cache Timing Vulnerability Score (CTVS)

Empirical Findings and Guidance

2. VCUBench as a Volumetric Quantum Benchmark

Framework Definition

VCUBench Circuit Family

Pareto Frontier and Reporting

3. VCUBench for Visual Concept Unlearning in Multimodal LLMs

Motivation and Benchmark Structure

Tasks and Splits

Metrics

Protocol

Key Findings

4. Technical and Experimental Methodology Unification

Common Principles

5. Impact and Future Directions

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research