Papers
Topics
Authors
Recent
2000 character limit reached

APTBench Benchmark Suite

Updated 23 December 2025
  • APTBench is a dual-purpose benchmark suite offering datasets and tools for evaluating parametric timed automata in real-time systems and the agentic capabilities of base LLMs.
  • It features over 80 PTA models grouped into academic, industrial, and challenging unsolvable cases to benchmark verification techniques.
  • The suite also transforms multi-turn agent trajectories into single-turn evaluation tasks to measure planning, action, and micro-skills in LLMs with high predictive power.

APTBench denotes a suite of benchmark methodologies and datasets, each independently introduced in the literature for critically different domains: (1) parametric timed model checking in real-time and concurrent systems, and (2) benchmarking the agentic potential of base LLMs during pre-training. This article systematically details both meanings: APTBench as a repository for parametric timed automata (PTA) benchmarks (Étienne, 2018), and APTBench as the agentic LLM potential benchmark (Qin et al., 28 Oct 2025).

1. APTBench for Parametric Timed Model Checking

APTBench, in the context of real-time verification, is a curated library of parametric timed automata benchmarks, designed to evaluate and compare parametric timed model-checking tools and algorithms. It comprises over 80 models grouped into 34 benchmark families and encapsulates 120 verification problems sourced from both academic and industrial domains. These exemplify a range that includes small “toy” PTAs, industrial case studies, and instances known to be unsolvable by contemporary techniques (Étienne, 2018).

Formalism

Each APTBench model defines a network of PTAs, most following the specification of Alur, Henzinger & Vardi (1993), with extensions for stopwatches, shared rational variables, and parametric linear expressions as supported by the IMITATOR tool. A PTA is specified as a tuple (Σ,Q,q0,X,P,E,Inv)(\Sigma, Q, q_0, X, P, E, \mathrm{Inv}), consisting of:

  • Σ\Sigma: finite alphabet of synchronization actions,
  • QQ: finite set of locations, with initial q0q_0,
  • XX: real-valued clocks (x:dxdt=1\forall x: \frac{dx}{dt} = 1),
  • PP: finite set of unknown constants (parameters),
  • EE: set of edges EQ×G×Σ×2X×QE \subseteq Q \times G \times \Sigma \times 2^X \times Q, with guards gg and resets rr,
  • Inv:QG\mathrm{Inv}: Q \to G: invariants per location.

Guards and invariants use atomic constraints xcx \,\triangleright\, c, xpx \,\triangleright\, p, or xypx - y \,\triangleright\, p, where x,yXx, y \in X, cQc \in \mathbb{Q}, pPp \in P, and {<,,=,,>}\triangleright \in \{<, \leq, =, \geq, >\}. L/U-PTA discipline—partitioning parameters into lower-bound or upper-bound roles—yields subclasses with improved decidability.

Benchmark Categories

APTBench benchmarks are divided as follows:

Category Examples/Systems Properties
Academic Fischer protocols, CSMA/CD, flip-flop circuits, scheduling Models up to 16 parameters
Industrial Automotive (Hoxha-Abbas-Fainekos), Thales FMTV, Bluetooth BRP Unknown/imprecise timing
Hard/Unsolvable 1/n toy, large protocol variants, complex resets Intractable for current tools

Each model is provided in the .imi input format for IMITATOR, with properties (reachability, unavoidability, optimization, robustness) defined in separate configuration files.

Toolchain and Integration

APTBench supports direct usage within IMITATOR (recommended), with translation scripts facilitating compatibility with HyTech, PHAVer, SpaceEx, Romeo, and Symrob. Tasks include EF-synthesis, optimization, and reachability analyses.

Solvability Insights

Empirical findings reveal that models with fewer than five parameters/clocks and without advanced features (stopwatches, shared variables) are solvable near-instantaneously. Greater complexity—large process counts, complex parameter interactions, or use of invariants and stopwatches—escalates computational difficulty, driving many instances into timeout or unsolvability categories. L/U-PTA and invariant-free variants are significantly easier for state-of-the-art tools (Étienne, 2018).

2. APTBench: Benchmarking the Agentic Potential of Base LLMs

In contrast, APTBench in the LLM paradigm defines a benchmark suite and methodology to quantitatively measure agentic abilities—planning, action selection, and atomic skill execution—of base (pre-instruction, pre-RL) LLMs. This approach responds to the inadequacy of traditional static benchmarks (MMLU, GSM8K, HumanEval), which underrepresent the Plan-Act-Observe loop crucial to agentic capabilities, and the inability of full agent-centric benchmarks (e.g., SWE-bench Verified, InfoDeepSeek) to operate on “base” models that lack post-training (Qin et al., 28 Oct 2025).

Methodology

APTBench systematically converts real agent multi-turn trajectories—collected from curated human or agent Plan-Action-Feedback logs—into single-turn items suitable for base LLM evaluation. The conversion process comprises:

  1. Task Collection: Gather and filter only fully resolved agent interaction trajectories, validated via rejection sampling and human checks.
  2. Item Generation:
    • Core abilities per trajectory:
      • Planning (overall and stepwise),
      • Action (tool invocation, command generation),
      • Atomic Abilities (domain micro-skills: bug localization, citation extraction).
    • Items constructed as:
      • Multiple Choice Questions (MCQ) (correct answer + 3–5 distractors from LLM-degraded variants),
      • Text Completion (TC) (precise next action or answer).
  3. Answer Generation: Distractors created by syntactic/semantic perturbations, all answers subject to lightweight human check.

Task Classes and Examples

APTBench currently addresses two high-level domains:

Domain Subtask Example Task Formulation
SWE (software)* Environment Setup, Issue Fixing, Error Handling Next bash command, debugging plan MCQ
Deep Research Closed-Ended QA, Open-Ended QA, Citation Linking Multi-hop query planning, select best report MCQ

*Data sources for SWE tasks range over hundreds of GitHub repositories, high-quality agent logs, and established code/issue-fixing benchmarks.

Evaluation Metrics

Key metrics:

  • Accuracy (ACC) for MCQ: ACC=1Ni=1N1(y^i=yi)\mathrm{ACC} = \frac{1}{N}\sum_{i=1}^N \mathbb{1}(\hat{y}_i = y_i),
  • Exact Match (EM) for rigid completions: EM=1Ni=1N1(outputigoldi)\mathrm{EM} = \frac{1}{N}\sum_{i=1}^N \mathbb{1}(\mathrm{output}_i \equiv \mathrm{gold}_i),
  • ROUGE-1 for summarization/comparison: $\mathrm{ROUGE}\mbox{-}1 = \frac{\sum_{u \in \text{overlap}} \text{count}(u)}{\sum_{u \in \text{ref}} \text{count}(u)}$,
  • Predictivity by Pearson correlation ρ(X,Y)\rho(X, Y) between APTBench (pre-training) and downstream agent evaluation results.

The suite also supports (though not yet reported in the literature) calibration metrics such as Expected Calibration Error (ECE).

Empirical Results and Predictivity

Large-scale evaluation over numerous open-source base LLMs (Qwen3, Llama, Seed-OSS, GLM4.5, DeepSeek, Kimi-K2) demonstrates that APTBench performance is strongly predictive of downstream agent capabilities (e.g., ρ0.93\rho \approx 0.93 on SWE-bench), greatly exceeding traditional benchmarks (ρ0.12\rho \approx 0.12 for MMLU). Results further reveal:

  • Emergent agentic abilities are observed above 2–4B parameter scale.
  • Agent-oriented pre-training data increases agentic benchmark scores regardless of size.
  • Mid-sized models (~30–40B) can match or surpass much larger models if provided targeted data.

Computational and economic advantages over full agent benchmarks are substantial; APTBench requires only single-turn inference and costs ~$0.5–2 per model versus$10–1,000 for agent rollouts (Qin et al., 28 Oct 2025).

3. Workflow, Integration, and Usage Recommendations

PTA Model Checking

Recommended workflow includes:

  • Selecting a desired .imi benchmark from the APTBench library.
  • Running IMITATOR with reachability or synthesis flags (e.g., imitator --ef-synth model.imi property.cfg).
  • Translating models for other analyzers as needed.
  • Adjusting parameter bounds or enabling L/U options for tractability.
  • Using annotated difficulty metrics to select test cases for tools under development.

LLM Agentic Benchmarking

For LLM development, APTBench is intended for routine model evaluation during large-scale pre-training. Integration workflow involves:

  • Applying the suite of MCQ/TC items to the base model using few-shot prompting.
  • Averaging raw metrics (ACC, EM, ROUGE) across subdomains (SWE, Deep Research).
  • Tracking pre-training improvements, agentic skill emergence, and identifying curriculum/data mix weaknesses for future training.

A suite run requires no environmental API or container orchestration.

4. Comparative and Predictive Features

Benchmark Type Coverage of Plan-Act Loop Cost per Model Eval Predictive Power (ρ\rho w/ Agent Perf.)
General static benchmarks No $0.1–1 |$\sim0.1</td><td></td></tr><tr><td>Fullagenticevaluations</td><td>Yes(multiturn)</td><td></td> <td></td> </tr> <tr> <td>Full agentic evaluations</td> <td>Yes (multi-turn)</td> <td>10–1000</td><td></td> <td>\gg0.8</td></tr><tr><td><strong>APTBench</strong></td><td>Partial(singleturn,agentdrivenMCQ/TC)</td><td></td> </tr> <tr> <td><strong>APTBench</strong></td> <td>Partial (single-turn, agent-driven MCQ/TC)</td> <td>0.5–2</td><td></td> <td>\sim0.9$

APTBench thus achieves a unique balance between practical efficiency and diagnostic power, facilitating frequent checks and guiding data or architecture selection in both verification and agentic LLM workflows (Étienne, 2018, Qin et al., 28 Oct 2025).

5. Extensions and Open Challenges

  • Model Checking: Expanding the set of L/U-PTA models, modeling hybrid systems, and improving tool scalability on high-parameter instances remain open avenues.
  • Agentic LLM Benchmarking: Prospective directions include automated task synthesis, incorporation of calibration metrics (ECE, Brier scores), adaptive training curricula based on feedback, and multi-agent scenario extensions.
  • Cross-Domain Synthesis: There is potential to further automate the generation of agent trajectories, as well as enriching PTA benchmarks with instances drawn from novel application domains or unexplored parametrizations.

This coverage is grounded exclusively in the published definitions and reported methods from (Étienne, 2018) and (Qin et al., 28 Oct 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to APTBench.