CovertComBench: CC & Microarchitectural Tests

Updated 2 February 2026

CovertComBench is a domain-specific benchmarking framework that evaluates both wireless covert communication using LLMs and microarchitectural cache vulnerabilities.
It employs rigorous test pipelines including MCQs, derivation questions, and code generation tasks to assess model reasoning and coding precision under security constraints.
The framework provides quantifiable metrics like covert communication rate and Cache Timing Vulnerability Score to guide enhancements in LLM performance and processor defenses.

CovertComBench is a comprehensive, domain-specific benchmarking framework targeting two distinct but conceptually aligned fronts: (1) the systematic evaluation of LLMs for wireless covert communication under strict information-theoretic constraints, and (2) the exhaustive measurement and classification of microarchitectural vulnerabilities to cache-timing side and covert channels in processor designs. Originally introduced in the context of LLM benchmarking for detection-constrained wireless links and subsequently extended to processor-channel security assessment, CovertComBench unifies formal modeling, automated test generation, and reproducible scoring mechanisms to quantify and compare the capabilities and weaknesses of both machine learning models and microarchitectural defenses under adversarial or steganographic constraints (Liu et al., 26 Jan 2026, Deng et al., 2019).

1. Detection-Theoretic Covert Communication: Problem Formulation

In wireless covert communication (CC), the central objective is to hide not only the content but the very existence of a transmission, rendering the communication statistically indistinguishable from noise to a warden (W). Unlike traditional wireless links focused on maximizing throughput or reliability under resource constraints, CC design is fundamentally a detection-theoretic constrained optimization problem. The transmitter (S) and receiver (R) communicate over a noisy channel, with a warden observing through an independently noisy channel. The system is governed by a covertness constraint on the statistical divergence between the warden's observations under two hypotheses: $P_0$ (no covert signal) and $P_1$ (covert transmission). The constraint is typically expressed as

$D(P_0\Vert P_1) \leq \epsilon,$

where $D$ denotes Kullback--Leibler divergence and $\epsilon$ is a small threshold ensuring the warden's detection probability stays below a fixed false-alarm or detection level $\alpha$ . Optimization seeks to maximize legitimate system utility (e.g., covert rate, SNR at $R$ ) subject to $D(P_0\Vert P_1) \leq \epsilon$ . In common additive white Gaussian noise (AWGN) CC models, $P_0$ and $P_1$ are Gaussian distributions with variances $P_1$ 0 and $P_1$ 1, leading to a closed-form divergence

$P_1$ 2

and the design maximizes the CC rate

$P_1$ 3

with the KLD constraint enforced (Liu et al., 26 Jan 2026). Standard throughput maximization is thus inadequate; CC demands rigorous statistical reasoning and multi-step optimization centered on covertness.

2. CovertComBench for LLM Evaluation: Pipeline and Methodology

CovertComBench decomposes the CC pipeline into three rigorous stages, each designed to probe a different axis of machine learning model capability:

A. Conceptual Understanding (MCQs):

Multiple-choice questions (237 items; four distractors each) test the model's grasp of CC concepts, including covertness-rate trade-offs, detection limits, warden strategy adaptation, covert power allocation, and statistical signal distributions. Each MCQ is scored binary (utility $P_1$ 4 if correct).

B. Optimization Derivation Questions (ODQs):

Open-ended derivations (148 items) require models to perform accurate stepwise mathematical reasoning, such as deriving closed-form optimal covert power, maximizing throughput under KLD constraints, or articulating Lagrangian/hypothesis-testing procedures. Each response is graded on a multi-checkpoint rubric, yielding a process score $P_1$ 5 over key steps, with an overall score

$P_1$ 6

where $P_1$ 7 indicates final correctness and $P_1$ 8 (typically 0.7) weights process versus endpoint accuracy.

C. Code Generation Questions (CGQs):

Programming tasks (132 items) demand translation of CC theory into executable code—tasks include KL divergence calculation, simulation of detection tests, and numerical optimization under constraints. Each code submission is graded automatically with a maximum of three rounds (one initial and up to two error-feedback iterations), receiving 10, 7, 4, or 0 points according to the round in which unit tests are passed.

3. Formalism, Task Design, and Automated Scoring

CovertComBench imposes a rigorously unified formalism across all stages. The canonical CC optimization problem posed:

$P_1$ 9

maps directly to model tasks—whether selecting correct trade-offs in MCQs, performing stepwise derivations in ODQs, or implementing the KLD constraint in CGQs.

Scoring integrates both automatic and expert evaluation. MCQs and CGQs are fully automated via gold-answer and unit-test comparison. ODQs are primarily graded by human experts using detailed rubrics, but CovertComBench explores "LLM-as-Judge" (LAJ) scoring—where a separate trusted LLM is prompted with the multi-checkpoint rubric to generate a partial credit score $D(P_0\Vert P_1) \leq \epsilon,$ 0. The mean absolute error (MAE) between $D(P_0\Vert P_1) \leq \epsilon,$ 1 and human grading quantifies the reliability of LLM-based automated assessment, revealing critical limits of automated grading in statistically sensitive domains (Liu et al., 26 Jan 2026).

4. Empirical Results and Benchmark Insights

Benchmarking across 14 LLMs (API-based and locally hosted, 7B–671B parameters) demonstrates pronounced performance stratification:

Task	Top API Model Accuracy	Best Local Model Accuracy	Weakest LLM Performance
MCQ	OpenAI-o3 81.9%	64%	Wizard-math-70B 16.5%
ODQ	OpenAI-o3 55.4%	44%	Wizard-math-70B 18%
CGQ	Gemini-3.0-Pro 83.3%	67%	—

Top-tier API models demonstrate strong conceptual recognition (MCQ F1 up to 0.91) and code synthesis (83.3% one-shot CGQ success), but optimization derivation performance remains limited (18–55% accuracy). Local models show substantial deficits, highlighting gaps in mathematical reasoning for covertness-constrained optimization.

The LLM-as-Judge framework yields process scores differing by 3.2–4.1 points (out of 10) on average from human rubrics, reflecting polarization tendencies and lack of human-like granularity in nuanced assessment.

Key reported issues:

LLMs frequently neglect KLD constraints in multistep derivations, instead reverting to unconstrained maximization.
Symbolic integrations and statistical expectation steps are inconsistently applied.
Code hallucinations persist even after iterative error feedback, often invoking non-existent libraries or misusing function semantics.

A plausible implication is that present LLMs are most reliable as high-level "implementation assistants"—effective at concept identification and code generation—rather than as autonomous solvers for mathematically rigorous, security-constrained optimization required by CC (Liu et al., 26 Jan 2026).

5. CovertComBench for Microarchitectural Channel Assessment

In microarchitectural security, CovertComBench implements an exhaustive benchmarking methodology for processor-side cache-timing covert channels (Deng et al., 2019). The framework:

Extends the canonical three-step attack model (Preparation, Trigger, Observe), incorporating both local and remote core access, explicit invalidation (CLFLUSH), read/write at all stages, and both time-sliced and hyperthreaded co-scheduling.
Formalizes abstract state transition triples $D(P_0\Vert P_1) \leq \epsilon,$ 2 over 17 cache-line states, yielding 4,913 possible transitions. Of these, 88 represent provably-distinct "Strong" vulnerabilities—triples where observed timing distribution unambiguously partitions secret address states.
Classifies vulnerabilities into six categories: Internal/External and Address-based/Set-based/Both (I-A, I-S, I-SA, E-A, E-S, E-SA).

A Python-based generator instantiates 1,094 C benchmarks for all variant implementations, employing statistical tests (Welch's t-test, $D(P_0\Vert P_1) \leq \epsilon,$ 3) on the timing data for distinct secret values. Results are summarized via the Cache Timing Vulnerability Score (CTVS):

$D(P_0\Vert P_1) \leq \epsilon,$ 4

where $D(P_0\Vert P_1) \leq \epsilon,$ 5 counts vulnerable abstract attack patterns (out of $D(P_0\Vert P_1) \leq \epsilon,$ 6). Lower CTVS implies better microarchitectural protection.

6. Recommendations and Future Directions

The current landscape presents clear directions for both LLM- and hardware-related covert security research:

LLM Tool Augmentation: Incorporating symbolic computation engines (e.g., SymPy, Mathematica) through function-calling APIs is recommended for LLMs, alleviating deficits in symbolic integration and algebraic rigor critical for constrained wireless optimization.
Negative Sample Training: Including plausible incorrect derivations in LLM training data can enhance discriminative reasoning on tasks demanding precise mathematical correctness.
Closed-Loop Feedback Agents: Development of autonomous agents capable of structured error-trace parsing and methodical debugging is necessary to overcome persistent code hallucination and misconception.
Benchmark Extension: CovertComBench will expand to cover advanced channel models, including IRS-aided and MIMO covert links, as well as next-generation LLMs with improved tool integration and domain-specific pretraining.
Microarchitectural Guidance: CTVS categorization allows microarchitecture designers to localize defensive shortcomings—e.g., high external SA scores prompting cache set and address randomization, or persistent internal vulnerabilities indicating required mitigation at the MSHR, store buffer, or writeback logic.

A summary conclusion is that CovertComBench establishes the first systematic, reproducible domain-specific testbed for both wireless LLM security constraints and exhaustive processor-timing vulnerability assessment. It exposes prevailing system and modeling limitations, setting the stage for more trustworthy, tool-augmented AI and hardware defences against covert and side-channel attacks (Liu et al., 26 Jan 2026, Deng et al., 2019).

Markdown Report Issue Upgrade to Chat

References (2)

CovertComBench: The First Domain-Specific Testbed for LLMs in Wireless Covert Communication (2026)

A Benchmark Suite for Evaluating Caches' Vulnerability to Timing Attacks (2019)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CovertComBench.