APTBench Benchmark Suite
- APTBench is a dual-purpose benchmark suite offering datasets and tools for evaluating parametric timed automata in real-time systems and the agentic capabilities of base LLMs.
- It features over 80 PTA models grouped into academic, industrial, and challenging unsolvable cases to benchmark verification techniques.
- The suite also transforms multi-turn agent trajectories into single-turn evaluation tasks to measure planning, action, and micro-skills in LLMs with high predictive power.
APTBench denotes a suite of benchmark methodologies and datasets, each independently introduced in the literature for critically different domains: (1) parametric timed model checking in real-time and concurrent systems, and (2) benchmarking the agentic potential of base LLMs during pre-training. This article systematically details both meanings: APTBench as a repository for parametric timed automata (PTA) benchmarks (Étienne, 2018), and APTBench as the agentic LLM potential benchmark (Qin et al., 28 Oct 2025).
1. APTBench for Parametric Timed Model Checking
APTBench, in the context of real-time verification, is a curated library of parametric timed automata benchmarks, designed to evaluate and compare parametric timed model-checking tools and algorithms. It comprises over 80 models grouped into 34 benchmark families and encapsulates 120 verification problems sourced from both academic and industrial domains. These exemplify a range that includes small “toy” PTAs, industrial case studies, and instances known to be unsolvable by contemporary techniques (Étienne, 2018).
Formalism
Each APTBench model defines a network of PTAs, most following the specification of Alur, Henzinger & Vardi (1993), with extensions for stopwatches, shared rational variables, and parametric linear expressions as supported by the IMITATOR tool. A PTA is specified as a tuple , consisting of:
- : finite alphabet of synchronization actions,
- : finite set of locations, with initial ,
- : real-valued clocks (),
- : finite set of unknown constants (parameters),
- : set of edges , with guards and resets ,
- : invariants per location.
Guards and invariants use atomic constraints , , or , where , , , and . L/U-PTA discipline—partitioning parameters into lower-bound or upper-bound roles—yields subclasses with improved decidability.
Benchmark Categories
APTBench benchmarks are divided as follows:
| Category | Examples/Systems | Properties |
|---|---|---|
| Academic | Fischer protocols, CSMA/CD, flip-flop circuits, scheduling | Models up to 16 parameters |
| Industrial | Automotive (Hoxha-Abbas-Fainekos), Thales FMTV, Bluetooth BRP | Unknown/imprecise timing |
| Hard/Unsolvable | 1/n toy, large protocol variants, complex resets | Intractable for current tools |
Each model is provided in the .imi input format for IMITATOR, with properties (reachability, unavoidability, optimization, robustness) defined in separate configuration files.
Toolchain and Integration
APTBench supports direct usage within IMITATOR (recommended), with translation scripts facilitating compatibility with HyTech, PHAVer, SpaceEx, Romeo, and Symrob. Tasks include EF-synthesis, optimization, and reachability analyses.
Solvability Insights
Empirical findings reveal that models with fewer than five parameters/clocks and without advanced features (stopwatches, shared variables) are solvable near-instantaneously. Greater complexity—large process counts, complex parameter interactions, or use of invariants and stopwatches—escalates computational difficulty, driving many instances into timeout or unsolvability categories. L/U-PTA and invariant-free variants are significantly easier for state-of-the-art tools (Étienne, 2018).
2. APTBench: Benchmarking the Agentic Potential of Base LLMs
In contrast, APTBench in the LLM paradigm defines a benchmark suite and methodology to quantitatively measure agentic abilities—planning, action selection, and atomic skill execution—of base (pre-instruction, pre-RL) LLMs. This approach responds to the inadequacy of traditional static benchmarks (MMLU, GSM8K, HumanEval), which underrepresent the Plan-Act-Observe loop crucial to agentic capabilities, and the inability of full agent-centric benchmarks (e.g., SWE-bench Verified, InfoDeepSeek) to operate on “base” models that lack post-training (Qin et al., 28 Oct 2025).
Methodology
APTBench systematically converts real agent multi-turn trajectories—collected from curated human or agent Plan-Action-Feedback logs—into single-turn items suitable for base LLM evaluation. The conversion process comprises:
- Task Collection: Gather and filter only fully resolved agent interaction trajectories, validated via rejection sampling and human checks.
- Item Generation:
- Core abilities per trajectory:
- Planning (overall and stepwise),
- Action (tool invocation, command generation),
- Atomic Abilities (domain micro-skills: bug localization, citation extraction).
- Items constructed as:
- Multiple Choice Questions (MCQ) (correct answer + 3–5 distractors from LLM-degraded variants),
- Text Completion (TC) (precise next action or answer).
- Core abilities per trajectory:
- Answer Generation: Distractors created by syntactic/semantic perturbations, all answers subject to lightweight human check.
Task Classes and Examples
APTBench currently addresses two high-level domains:
| Domain | Subtask | Example Task Formulation |
|---|---|---|
| SWE (software)* | Environment Setup, Issue Fixing, Error Handling | Next bash command, debugging plan MCQ |
| Deep Research | Closed-Ended QA, Open-Ended QA, Citation Linking | Multi-hop query planning, select best report MCQ |
*Data sources for SWE tasks range over hundreds of GitHub repositories, high-quality agent logs, and established code/issue-fixing benchmarks.
Evaluation Metrics
Key metrics:
- Accuracy (ACC) for MCQ: ,
- Exact Match (EM) for rigid completions: ,
- ROUGE-1 for summarization/comparison: $\mathrm{ROUGE}\mbox{-}1 = \frac{\sum_{u \in \text{overlap}} \text{count}(u)}{\sum_{u \in \text{ref}} \text{count}(u)}$,
- Predictivity by Pearson correlation between APTBench (pre-training) and downstream agent evaluation results.
The suite also supports (though not yet reported in the literature) calibration metrics such as Expected Calibration Error (ECE).
Empirical Results and Predictivity
Large-scale evaluation over numerous open-source base LLMs (Qwen3, Llama, Seed-OSS, GLM4.5, DeepSeek, Kimi-K2) demonstrates that APTBench performance is strongly predictive of downstream agent capabilities (e.g., on SWE-bench), greatly exceeding traditional benchmarks ( for MMLU). Results further reveal:
- Emergent agentic abilities are observed above 2–4B parameter scale.
- Agent-oriented pre-training data increases agentic benchmark scores regardless of size.
- Mid-sized models (~30–40B) can match or surpass much larger models if provided targeted data.
Computational and economic advantages over full agent benchmarks are substantial; APTBench requires only single-turn inference and costs ~$0.5–2 per model versus$10–1,000 for agent rollouts (Qin et al., 28 Oct 2025).
3. Workflow, Integration, and Usage Recommendations
PTA Model Checking
Recommended workflow includes:
- Selecting a desired .imi benchmark from the APTBench library.
- Running IMITATOR with reachability or synthesis flags (e.g.,
imitator --ef-synth model.imi property.cfg). - Translating models for other analyzers as needed.
- Adjusting parameter bounds or enabling L/U options for tractability.
- Using annotated difficulty metrics to select test cases for tools under development.
LLM Agentic Benchmarking
For LLM development, APTBench is intended for routine model evaluation during large-scale pre-training. Integration workflow involves:
- Applying the suite of MCQ/TC items to the base model using few-shot prompting.
- Averaging raw metrics (ACC, EM, ROUGE) across subdomains (SWE, Deep Research).
- Tracking pre-training improvements, agentic skill emergence, and identifying curriculum/data mix weaknesses for future training.
A suite run requires no environmental API or container orchestration.
4. Comparative and Predictive Features
| Benchmark Type | Coverage of Plan-Act Loop | Cost per Model Eval | Predictive Power ( w/ Agent Perf.) |
|---|---|---|---|
| General static benchmarks | No | $0.1–1 |$\sim0.110–1000\gg0.80.5–2\sim0.9$ |
APTBench thus achieves a unique balance between practical efficiency and diagnostic power, facilitating frequent checks and guiding data or architecture selection in both verification and agentic LLM workflows (Étienne, 2018, Qin et al., 28 Oct 2025).
5. Extensions and Open Challenges
- Model Checking: Expanding the set of L/U-PTA models, modeling hybrid systems, and improving tool scalability on high-parameter instances remain open avenues.
- Agentic LLM Benchmarking: Prospective directions include automated task synthesis, incorporation of calibration metrics (ECE, Brier scores), adaptive training curricula based on feedback, and multi-agent scenario extensions.
- Cross-Domain Synthesis: There is potential to further automate the generation of agent trajectories, as well as enriching PTA benchmarks with instances drawn from novel application domains or unexplored parametrizations.
This coverage is grounded exclusively in the published definitions and reported methods from (Étienne, 2018) and (Qin et al., 28 Oct 2025).