PS-Bench: Multi-Domain Benchmarking

Updated 1 February 2026

PS-Bench is a collection of benchmark initiatives in computational biology, robotics, and program synthesis that standardize performance evaluation with quantitative metrics.
It employs rigorous methodologies, including protein model quality metrics, multibody kinematics for exoskeletons, and diverse challenges in program synthesis, ensuring reproducible comparisons.
The suite drives research advancements by facilitating method development, enabling objective evaluations, and promoting community-driven dataset expansion across varied fields.

PS-Bench refers to several distinct benchmark suite initiatives across computational biology, robotics, and program synthesis, each designed as a standardized, quantitative test framework for performance evaluation, accuracy measurement, replicable benchmarking, and method comparison.

1. PSBench for Protein Complex Model Quality Assessment

PSBench is a large-scale benchmark suite for objective evaluation of protein complex structural model accuracy. It incorporates four labeled datasets produced during the 15th and 16th Critical Assessment of Protein Structure Prediction (CASP) experiments:

CASP15_inhouse_dataset (7,885 models, 31 targets)
CASP15_community_dataset (10,942 models, 40 targets)
CASP16_inhouse_dataset (1,009,050 models, 36 targets)
CASP16_community_dataset (12,904 models, 39 targets)

Targets exhibit wide sequence-lengths (96–8,460 residues), 25 stoichiometries, and cover 21 functional protein classes. All models result from blind protein structure prediction protocols and are annotated via an automated pipeline employing OpenStructure and USalign for chain matching and residue renumbering. Each structural decoy is labeled with ten complementary quality metrics:

Five global (e.g., RMSD of Cα atoms, four TM-score variants)
One local (lDDT)
Four interface-level scores (ICS, ICS_precision, ICS_recall, IPS, QS_global, QS_best, DockQ_wave)

Thresholds on DockQ_wave (>0.49 = "good", 0.23–0.49 = "acceptable", <0.23 = "bad") enable categorical quality classification. PSBench provides evaluation scripts for standard metrics: Pearson and Spearman correlations, AUROC, and ranking loss, facilitating rigorous cross-target comparisons.

Baseline Estimation of Model Accuracy (EMA) methods include AFM Confidence (AlphaFold2-Multimer output), DProQA (Gated Graph Transformer), VoroIF-GNN, VoroMQA-dark, GCPNet-EMA, and PSS. Single-model scores are normalized for incomplete decoys.

To demonstrate utility, the GATE (Graph trAnsformer for esTimation of model accuracy) method was trained on PSBench data and evaluated blindly in CASP16, achieving top rankings versus 37 competing EMA methods. PSBench reduces barriers for method development and standardizes comparison, permitting community-driven dataset expansion via its annotation pipeline (Neupane et al., 13 May 2025).

2. PS-Bench for Upper Limb Exoskeleton Evaluation

The Pronation–Supination test bench ("PS-Bench") is a reference-standard quantitative evaluation system for upper-limb exoskeletons developed by Nguiadem et al. It is based on a biofidelic prosthetic upper limb (3.5 kg) accurately reproducing radioulnar forearm mechanics and integrating four Dynamixel AX-18A servomotors with absolute position (0–300° encoded, 0–1023 units) and load sensors (0–2047 units). These are interfaced via an Arduino Uno controller in UART half-duplex mode. Data acquisition occurs at ~200 Hz with real-time feedback streaming.

A multibody kinematic model with 23 degrees of freedom is symbolically generated via constrained Lagrangian formalism in ROBOTRAN, encompassing thorax, shoulder girdle joints, humeroulnar flexion-extension, closed-loop PS mechanism, and wrist axes. Joint torque estimation utilizes both inverse dynamics (simulation) and load-to-torque conversion:

For servo load reading $X$ $X$ :
- If $X < 1024$ : $K = X \times 0.1\,\%$
- If $X > 1023$ : $K = (X-1024) \times 0.1\,\%$
Physical torque: $T = 1.8\,K$ (N·m)

Experimental protocols involve three test sessions per bench (intra/inter-session repeatability), recording peak deflections and torques during CCW/CW-driven PS cycles.

Quantitative ROM and torque findings include:

Simulated ROM: $146.84 \pm 14.32^\circ$
Experimental ROM: $156.26 \pm 4.71^\circ$
Simulated torque range: $0.20 \pm 0.05\,\mathrm{N\cdot m}$ (max $0.29\,\mathrm{N\cdot m}$ )
Experimental torque range: $0.28 \pm 0.06\,\mathrm{N\cdot m}$

Intraclass correlation coefficients (reliability): ICC = 0.96–0.98 intra-session, ICC = 0.81–0.93 inter-session. ROM overlap and torque correlation validate both physical and simulated model fidelity. Systematic torque overestimation in experiment vs. simulation emphasizes actuator sizing by peak rather than mean demand.

PS-Bench functions as a reproducible platform for benchmarking upper-limb exoskeleton kinematics and torques, actuator sizing, and non-invasive muscle force estimation. Planned developments include broader joint coverage and clinical-standard expansion (Nguiadem et al., 2021).

3. Historical Context and Motivation

Each PS-Bench initiative arises from the need to address field-specific benchmarking deficiencies:

In structural bioinformatics, reliable EMA assessment of complex protein models was hindered by lack of large, annotated datasets covering the full spectrum of prediction difficulty and protein classes; PSBench is designed to serve as a protein complex modeling "ImageNet".
For rehabilitation robotics, quantification and standardization of exoskeleton performance—especially joint ROM and torque—were recognized as critical challenges, motivating physical PS-Bench development.
In program synthesis, legacy benchmarks (PSB1) lost discriminatory power as techniques improved, spurring creation of PSB2 (the Second Program Synthesis Benchmark Suite) with more challenging, diverse problems (Helmuth et al., 2021).

4. Benchmark Methodologies and Evaluation Protocols

PSBench protocols enforce rigorous evaluation methodologies:

Protein complex EMA: Models are evaluated on ten orthogonal quality scores, using standardized annotation and normalization pipelines. Metrics are reported both per-target and as cross-target averages, with explicit best-practices (e.g., reporting tmscore_usalign_aligned and DockQ_wave).
Upper-limb exoskeletons: ROM and torque are computed using both multibody simulation and direct actuator telemetry, aligning experimental and modeled outputs, and quantifying protocol repeatability with statistical reliability measures (ICC).
Program synthesis: PSB2 problems feature high input-space cardinality and hand-curated edge-case datasets, abolishing overfitting and enforcing generalizability. Success Rate (SR), Generalization Rate (GR), and solution complexity serve as principal evaluation metrics (Helmuth et al., 2021).

5. Impact and Significance

PS-Bench benchmarks catalyze progress by setting objective standards and facilitating comparable evaluations:

In EMA for protein complexes, PSBench's scale and diversity support the development and validation of deep learning methods (e.g., GATE, MULTICOM_GATE) and enable fair, multi-method comparisons.
For exoskeletons, PS-Bench delivers precise mechanical and kinematic feedback, fostering replicable hardware and control system advancements across clinical and engineering domains.
In program synthesis, PSB2 raises the bar for automatic software creation, highlighting persistent challenges and guiding methodological innovation.

A plausible implication is that continued, community-driven PS-Bench extensions will be pivotal for the reproducibility and acceleration of progress in their respective fields.

6. Future Directions

The PS-Bench framework across domains is characterized by openness to extension:

PSBench (protein): Incorporation of community-submitted decoys, expansion of functional classes, and increasing heterogeneity in prediction difficulty. Annotation pipeline allows perpetual dataset growth.
PS-Bench (robotics): Extension to include elbow and wrist axes, integration of human subject trials, and clinical benchmarking.
Program synthesis: Multi-language benchmarks, parametrized problem difficulty, and real-time community problem contribution to maintain challenge relevance (Helmuth et al., 2021).

Each PS-Bench variant is positioned to evolve into a de facto standard, systematically reducing the evaluation bottleneck for new algorithmic and hardware innovations.