ParaBench: Unified Benchmark Suites

Updated 18 November 2025

ParaBench is a comprehensive collection of benchmark suites covering multimodal generative modeling, parametric timed automata, and parity games, enabling transparent and reproducible evaluation.
It emphasizes methodological rigor by incorporating diverse case studies, detailed diagnostic metrics, and standardized evaluation protocols across academic and industrial applications.
Designed to address error propagation and represent complex unsolvable cases, ParaBench facilitates fair comparisons and drives advances in algorithmic verification and synthesis.

ParaBench refers to multiple benchmark suites in computer science, serving as critical infrastructure for comparative evaluation in diverse areas such as multimodal generative modeling, parametric timed automata, and parity games. Notably, the term is used for: (1) a unified multimodal benchmark for reasoning-aware generation and editing (“MMaDA-Parallel” paradigm), (2) a comprehensive library of benchmarks for parametric timed model checking, and (3) an extensive suite for parity games used in formal verification, equivalence checking, and algorithm analysis. Each instantiation of ParaBench is designed with methodological rigor for transparent, reproducible, and representative evaluation of algorithms and tools.

1. Motivation and Benchmark Design Principles

ParaBench variants address deficiencies in existing benchmarks by emphasizing coverage, diversity, and diagnostic depth. In multimodal generation, the motivation is to expose and quantify the error-propagation failure mode in sequential, chain-of-thought pipelines—wherein poor or vague reasoning traces degrade the fidelity of the final output. In parametric timed verification, the objective is to standardize tool evaluation across academic, industrial, and "unsolvable" models, enabling fair comparison and methodological progress. For parity games, ParaBench integrates all published benchmarks and injects synthetic hard cases, ensuring representativity and practical relevance (Tian et al., 12 Nov 2025, André et al., 2021, Keiren, 2014, Étienne, 2018).

2. Dataset Construction and Benchmarks

2.1 Multimodal Reasoning Benchmark (ParaBench)

Composition: 300 evaluation-only prompts: 200 image-editing (5 balanced categories—Spatial, Temporal, Causal, World-Knowledge, General) and 100 open-ended text/image generation prompts.
Sources: Editing categories are mined from KRIS-Bench, Envisioning, Geneval, and curated ShareGPT4o samples. Generation prompts emphasize multi-object, compositional, and abstract instructions (Tian et al., 12 Nov 2025).
Modality: Each sample involves joint output—a reasoning trace (text) and a final image. Editing tasks require both an input image and a text instruction.

2.2 Parametric Timed Automata Benchmark (ParaBench)

Scale: In current releases, up to 56 benchmarks, 119 models, and 216 property-model pairs, grouped by logical "benchmark families" (e.g., Fischer, Gear, BRP, FMTV), often with parametric scaling (number of processes/gears, etc.).
Categories: Models are classified into overlapping categories such as Academic (45%), Automotive (17%), Industrial (28%), Real-Time System (39%), Protocol (29%), Toy (29%), and Unsolvable (15%).
Features: Includes support for liveness, reachability, stopwatches, multi-rate clocks, global rational-valued discrete variables, urgent locations, and ε-transitions.
Unsolvable Corner Cases: Contains 18 models explicitly constructed to be outside the scope of existing synthesis algorithms (e.g., parametric sets of the form $p \in \mathbb{N}$ or $p \in \{\frac{1}{n} \mid n \in \mathbb{N}_{>0}\}$ ) (André et al., 2021, Étienne, 2018).

2.3 Parity Game Benchmark (ParaBench)

Scale: 1,037 games across four major classes: model checking encodings, equivalence checking, decision-procedure reductions, and hard/randomly generated instances.
Graph Metrics: Vertices $|V|$ range from 2 to $4 \times 10^7$ , with comprehensive reporting of edge counts, alternation depths, SCC structure, diameters, and clustering.
Generation: Produced via mCRL2 (for LTS and $\mu$ -calculus encodings), MLSolver, and PGSolver (for random and adversarial games) (Keiren, 2014).

3. Tasks, Evaluation Methodology, and Formalisms

3.1 Multimodal Generation/Evaluation

Task Definition: Given a query $Q$ (prompt or prompt + input image), a model $\pi(Q) = (T, I)$ outputs a reasoning trace $T$ and a final image $I$ .
Metrics: Six axes—Text Quality (TQ), Text Alignment (TA), Image Quality (IQ), Image Alignment (IA), Image Consistency (IC; editing only), and Output Alignment (OA, cross-modal).
- OA is computed by both a CLIP-based proxy and GPT-4.1 LLM scores.
Judging: All scores are reference-free and assigned by GPT-4.1 over all 300 prompts. Cross-modal correlations (OA vs. IA, TA vs. OA) are used diagnostically, with observed Pearson $r$ and Spearman $\rho$ exceeding 0.7 (p < 0.01).

Metric	Symbol	Evaluator	Modality Pair
Text Q.	TQ	GPT-4.1	–
Text Align	TA(Q,T)	GPT-4.1	Q_text → T_text
Img. Q.	IQ	GPT-4.1	–
Img. Align	IA(Q,I)	GPT-4.1	Q_text → I_img
Img. Cons.	IC(I₁,I₂)	GPT-4.1	I_in → I_out (editing)
Output Al.	OA(T,I)	GPT-4.1, CLIP	T_text → I_img

Analysis: Sequential models reveal degradation in spatial and causal edits when chain-of-thought is enforced (e.g., Bagel drops in OA/IA for such tasks), directly correlating reasoning/trace quality with output visual quality.

3.2 Parametric Timed Automata

Model Formalism: Each benchmark specifies PTAs (clocks, parameters, discrete variables, locations, invariants, flow rates) and properties (safety/reachability: EF/AG, liveness: cycles, deadlock-freedom).
Tooling: Benchmarks are in IMITATOR .imi/.imiprop format. Automated batch scripts and result summaries include computation time, symbolic states, and synthesized constraints.
Coverage: Benchmarks stress both classical decidable fragments (L/U-PTA), expressiveness features (stopwatches, multi-rate, global vars), and symbolic-state-space scaling.

Category	#Models	% of Library
Academic	54	45%
Automotive	20	17%
Industrial	33	28%
Toy	34	29%
Unsolvable	18	15%

Performance Benchmarks: Median time for successful runs ≈ 2.82 s, but significant variance with outliers for larger or unsolvable problems.

3.3 Parity Game Suite

Formalism: $G=(V, V_0, V_1, E, \Omega)$ ; priorities and ownership provide the setting for model checking, equivalence algorithms, and game-solving.
Metrics and Analysis: Precomputed statistics include degree distributions, SCCs, diameters, alternation depths, and local clustering. Alternation depth is highly diagnostic of instance hardness.
Reproducibility: All games are in PGSolver plaintext format with recommended solver pipelines; repositories offer bzip2-compressed downloads and tools for metadata extraction.

4. Repository Organization, Access, and Usage

Parametric Timed Automata: Hosted at https://data.loria.fr/ParaBench/ with subdirectories for IMITATOR and JANI models, result files, and theoretical constraints for unsolvable cases.
Parity Games: GitHub repository at https://github.com/jkeiren/paritygame-generator, organizing games by class and scale, with all tools and scripts for loading, solving, and collecting statistics.
Multimodal Generation: Open-sourced benchmark and code at https://github.com/tyfeld/MMaDA-Parallel, prompts and evaluation routines fully available.

Suite	Format	Tool Support	Public Access
PTA/IMITATOR	.imi/.imiprop	IMITATOR, batch scripts	https://data.loria.fr/ParaBench/
Parity Games	.pg (PGSolver)	PGSolver, mCRL2, MLSolver	https://github.com/jkeiren/paritygame-generator
Multimodal	Prompt, txt, img	Python, torch, CLIP	https://github.com/tyfeld/MMaDA-Parallel

Usage Recommendations: For fair evaluation, researchers should include diverse class representatives, record CPU/memory usage, and report scatter plots of performance metrics versus instance structure.

5. Limitations and Open Challenges

PTA Benchmarks: No support for general hybrid automata (only constant-rate clocks). Several industrial models outside parameter-decidable subclasses. The subset of "unsolvable" instances is limited, calling for broader community contributions (André et al., 2021).
Parity Game Suite: Focused primarily on games relevant to formal verification; while structurally broad, may not capture all emergent application settings. Some synthetic hard cases reach high alternation depth, which may not reflect typical practical workloads (Keiren, 2014).
Multimodal Reasoning: Current ParaBench is confined to text and images. The authors recommend extension to video/audio and richer process-level reward models (Tian et al., 12 Nov 2025).
Tooling: Bidirectional conversion between formats (e.g., JANI ↔ IMITATOR) remains manual or incomplete in some suites.

6. Impact, Insights, and Research Directions

ParaBench benchmarks have contributed foundationally to the empirical evaluation of model checking, synthesis, and generative models:

Diagnostics: By explicitly measuring not just outcome quality but intermediate reasoning (as in multimodal ParaBench), researchers can directly attribute degradation to specific pipeline weaknesses.
Algorithmic Development: "Unsolvable" cases in PTA benchmarks act as stress-tests, promoting advances in parameter synthesis algorithms (e.g., beyond finite-convex-polyhedra solution sets).
Cross-Domain Applicability: The breadth of instance types (protocols, scheduling, process networks, creative generations) supports fair comparison of tools with diverse internal architectures.
Reproducibility: Unified formats, exhaustive metadata, and recommended evaluation practice enable rigorous replication and benchmarking.
Research Directions: Extending ParaBench to more modalities (video, audio), enhancing stepwise alignment metrics, expanding hard-case libraries, and organizing competitive evaluation events are all identified as next steps.

ParaBench, across each subdomain, serves as a reference framework enabling both robust tool development and deep algorithmic analysis (Tian et al., 12 Nov 2025, André et al., 2021, Keiren, 2014, Étienne, 2018).