VCUBench: Multi-Domain Benchmark Suite
- VCUBench is a set of benchmark suites that define standardized evaluation protocols across diverse domains such as cache timing vulnerability, quantum circuit volumetrics, and visual concept unlearning.
- Each suite features automated test generation, clear experimental methodologies, and precise, domain-specific metrics to enable reproducible and quantitative cross-system comparisons.
- Empirical findings using VCUBench guide design improvements in hardware security, quantum device performance, and ethical machine unlearning, fostering robust, community-driven research.
VCUBench refers to multiple, independently developed benchmark suites designed for distinct research domains, each providing community-driven, reproducible standards for evaluating advanced models or systems. Published VCUBench suites address: (1) cache side-channel vulnerability benchmarking in computer architecture, (2) quantum processor volumetric capacity benchmarking, and (3) group-context visual concept unlearning in multimodal LLMs. Each instantiation defines both the technical scope and experimental methodology for quantitative, cross-system comparison.
1. VCUBench for Cache Timing Vulnerability Benchmarking
VCUBench is the first fully-automated benchmark suite to comprehensively enumerate and test all timing-based cache side- and covert-channel vulnerabilities in modern L1 data-cache designs (Deng et al., 2019). This enables architecture researchers to characterize, compare, and improve hardware security guarantees.
Overview and Motivation
Side-channel attacks such as Prime+Probe, Flush+Reload, and SpectrePrime exploit timing differences in cache operations, threatening confidentiality. Prior work chiefly cataloged attacks in an ad hoc manner without systematic enumeration or evaluation protocols. VCUBench fills this gap, deriving 88 theoretically possible “Strong” vulnerability patterns using a refined three-step attack model, and automatically emitting 1,094 distinct C benchmarks to probe them.
Three-Step Attack Model
VCUBench’s attack modeling abstracts every cache line timing-leak as a sequence:
- S1: Preparation (e.g., load, store, flush, invalidate)
- S2: Interference (acts by attacker or victim)
- S3: Timing Observation (through load, store, etc.)
Each action may occur in 17 microarchitectural cache-line states (across local/remote L1–L3, clean/dirty, DRAM), expanding to 4,913 possible step triples; 88 yield strong, attacker-recoverable leaks as determined by modeling and simulation.
Vulnerability Enumeration and Test Instantiation
The 88 “Strong” vulnerabilities are categorized by interference type (internal vs. external) and leakage granularity (address-based, set-based, or combined). Each is mapped to C code templates that control executing threads (attacker, victim), access patterns, thread binding (time-sliced or hyperthreaded), and observation primitives (load, store, coherence invalidation).
Table: Example Vulnerability Patterns
| Index | Step Triple | Category | Canonical Attack Name |
|---|---|---|---|
| 1 | A{inv}→V_u→V_a | I–A | Cache Collision |
| 87 | A_d→V_u→A_d{inv} | E–S | Prime+Probe Inv. |
| 88 | A_a→V_u→A_a{inv} | E–SA | Prime+Probe Inv. |
Cache Timing Vulnerability Score (CTVS)
The aggregate Cache Timing Vulnerability Score is defined as the fraction of strong vulnerabilities observed in practice:
where if pattern is exploitable on target , else 0. Lower CTVS indicates a more secure cache. Scores are directly comparable across machines, RTL prototypes, and simulators.
Empirical Findings and Guidance
- Real hardware tests on Intel and AMD platforms reveal only ~50–80 vulnerabilities per chip manifest in practice.
- Hyperthreaded configurations expose more channels than time-sliced.
- Write/Store-based and coherence-invalidation leakages are as prevalent as load/flush-based ones.
- Strong CTVS dependence observed: e.g., AMD FX-8150 exhibits 0.57, Intel Xeon E5-1620 up to 0.83.
Design recommendations include prioritizing store-timing defenses, coherence randomization, hyperthread partitioning, and multi-state latency homogenization. Iterative CTVS-guided simulation is advised for design closure (Deng et al., 2019).
2. VCUBench as a Volumetric Quantum Benchmark
In the quantum computing domain, VCUBench manifests as a volumetric benchmark—a parameterized, multidimensional extension of the quantum volume paradigm designed for broad cross-platform quantum characterization (Blume-Kohout et al., 2019).
Framework Definition
A volumetric benchmark (VB) defines a map
with
- : circuit width (qubits)
- : circuit depth (logical layers)
- : an ensemble of test circuits at given shape
Each test suite specifies native compilation, per-circuit success criteria (e.g., heavy-output probability), aggregation methods, and experiment design.
VCUBench Circuit Family
VCUBench adopts the quantum volume/IBM-style random circuit family:
1 2 3 |
for ℓ=1 to d: if ℓ odd: random permutation π_ℓ ∈ S_w if ℓ even: pairwise SU(4) random gates on ⌊w/2⌋ disjoint pairs |
A fixed number of circuits (e.g., K=20 per (w,d)) is tested with 2000 shots each. Success requires the average heavy-output probability to reach 2/3 statistical threshold.
Pareto Frontier and Reporting
Results are plotted on the (width, depth) grid, with the Pareto frontier indicating maximal circuit depth per width for successful execution. Overlaying classical noise models enables deviation analysis.
VCUBench ensures open, cross-platform reproducibility via strict random-seed logging, public circuit generators, and clearly codified passing rules (Blume-Kohout et al., 2019).
3. VCUBench for Visual Concept Unlearning in Multimodal LLMs
A recent instantiation of VCUBench (unrelated to the aforementioned uses) targets the evaluation of machine unlearning (MU) of fine-grained visual concepts in multimodal LLMs (MLLMs) (Chen et al., 14 Nov 2025).
Motivation and Benchmark Structure
Contemporary regulatory and ethical constraints (e.g., “right to be forgotten”) mandate mechanisms for efficient, precise erasure of sensitive visual content from LLMs. Existing MU benchmarks are text-only or use unrealistic setups. VCUBench addresses this by curating real-world multi-person and individual images of five public figures, each annotated with 20 VQA queries probing identity, spatial relations, and commonsense context.
Tasks and Splits
- Target-Single: Solo images of each figure.
- Target-Group: Group images containing the figure alongside others.
- Non-Target-Single / Group: Controls for collateral forgetting and retention.
A held-out test set ensures evaluation of model changes without retraining artifacts.
Metrics
VCUBench defines six core quantitative metrics:
- Target Forgetting Accuracy (TFA):
- Non-Target Retain Accuracy (NTRA):
- Group Retain–Forget F1 (GRF-F1):
- Efficacy on Single-Person (EFF): Fraction of correct erasure in singles
- Generality: Held-out ScienceVQA accuracy for side-effect analysis
- Perplexity (PPL): Text fluency (non-penalizing for forgotten tokens)
Protocol
All MU methods operate on frozen base models (e.g., LLaVA-1.5) with parameter-efficient adapter tuning. Baselines include gradient ascent on targets, preference optimization for abstention, gradient ascent + KL regularization, and state-of-the-art selective unlearning (SIU). The AUVIC framework applies adversarial perturbations and Gumbel-Softmax sampling for collateral retention.
Key Findings
AUVIC achieves TFA above 93%, high NTRA (above 80%), and strongly balanced GRF-F1 (~88%). Performance ablates significantly without adversarial perturbation or anchor sampling. AUVIC preserves general VQA and text fluency, demonstrating the importance of joint design for precision forgetting and retention (Chen et al., 14 Nov 2025).
4. Technical and Experimental Methodology Unification
Across all manifestations, VCUBench comprehensively enumerates the relevant threat/model/task space, constructs representative and challenging real-world test instances, and provides automated evaluation harnesses for cross-platform, cross-research-group reproducibility.
Common Principles
- Enumeration: Theoretical completeness (e.g., all strong vulnerabilities, all (w,d) circuit shapes, all group-member settings).
- Automated Generation: Code/data synthesis covering the span of test patterns.
- Metric Formalization: Domain-appropriate, scalar yet decomposable performance metrics (e.g., CTVS, Pareto frontier, TFA/GRF-F1).
- Open Sourcing: Reference implementations and canonical data splits.
5. Impact and Future Directions
VCUBench benchmarks have direct impact on hardware security, quantum device evaluation, and trustworthy multimodal model design. As their domains converge, advances in model unlearning, circuit reliability, and side-channel resilience increasingly inform—and are informed by—community benchmark suites.
Cache Security: VCUBench guides practical defense design, enabling architects to quantitatively close major timing leak channels (Deng et al., 2019).
Quantum Computing: Pareto frontier visualizations and test reproducibility from VCUBench underpin fair cross-device comparison, standards compliance, and experimental reproducibility (Blume-Kohout et al., 2019).
Machine Unlearning: As regulatory mandates grow, VCUBench establishes an empirical baseline for selective, precise, and reproducible identity erasure in vision-LLMs (Chen et al., 14 Nov 2025).
Planned future developments include extending VCUBench to additional modalities, richer feature sets, more scalable anonymization, robust simulation-based protocols, and human–AI competitive evaluation (Blume-Kohout et al., 2019, Chen et al., 14 Nov 2025).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free