ATBench: Advanced Benchmarking for ML
- ATBench is a suite of domain-specific benchmarks that rigorously evaluates state-of-the-art machine learning systems across federated learning, AI training, assistive vision-language, adversarial robustness, atomic modeling, and agent safety.
- Each benchmark employs specialized protocols and metrics—such as measuring adaptation, trust, and reasoning in FL or imperceptibility in adversarial attacks—to ensure reproducible and comprehensive evaluations.
- By addressing real-world challenges like distribution shifts, fairness issues, and privacy risks, ATBench drives innovation, reliable performance comparisons, and methodical advancement in machine learning research.
ATBench
ATBench refers to several distinct, prominent benchmarks in machine learning, each of which defines state-of-the-art protocols and metrics for systematic and reproducible evaluation within its specialized domain. Notably, ATBench denotes: (1) ATR-Bench, a comprehensive framework for federated learning (FL) systematically assessing Adaptation, Trust, and Reasoning; (2) the AIBench Training suite, an industry-standard AI training benchmark; (3) @Bench for multi-task human-centered assistive vision-LLMs; (4) TabAttackBench for adversarial robustness of tabular models; (5) AtomBench for generative atomic structure modeling; and (6) ATBench as the Agent Trajectory Safety and Security Benchmark in the domain of autonomous agent safety evaluation. Each instantiation is domain-specific but shares a common goal: establishing rigorous, multi-faceted, and generalizable reference points for development, comparison, and analysis of advanced machine learning systems.
1. ATR-Bench: Federated Learning Benchmark for Adaptation, Trust, and Reasoning
ATR-Bench is a unified evaluation suite for horizontal federated learning (HFL), addressing the inherent limitations of classical FL benchmarks—typically focused on IID or mildly non-IID accuracy—by explicitly measuring three orthogonal performance dimensions: Adaptation (generalization and personalization under heterogeneity), Trust (robustness to adversaries, fairness), and Reasoning (distributed interpretability and structured inference) (Ashraf et al., 22 May 2025). This multidimensional perspective is motivated by real-world FL deployments where cross-client and out-of-client distribution shifts, malicious participants, and the need for transparent, interpretable decision making fundamentally constrain utility and reliability.
Foundational Pillars
- Adaptation: Distinguishes between cross-client and out-of-client shifts (e.g., label skew, domain skew), formalized either via standard federated aggregation objectives or augmented by client-local regularizers, control variates, or alignment terms.
- Trust: Subdivides into Byzantine tolerance, backdoor defense, and fairness in both collaboration (proxy Shapley-value contributions) and outcome (accuracy variance).
- Reasoning: Outlines initial protocols for federated interpretability, including distributed attention/saliency sharing, explanation-guided aggregation, and symbolic–neural hybrid approaches; standardized reasoning metrics remain undeveloped.
Key Metrics
| Dimension | Core Metrics | Prototypical Formula |
|---|---|---|
| Adaptation | Cross-client accuracy , out-client accuracy | |
| Trust | Robustness (), backdoor success rate , fairness (, stddev) | |
| Reasoning | (No consensus metric; trace coherence, privacy leak as proposals) | – |
Datasets and Protocols
ATR-Bench spans canonical datasets (CIFAR-10/100, MNIST, Tiny-ImageNet, Office-Caltech, PACS) and implements standardized communication, optimization, and evaluation protocols. Baselines include FedAvg, FedProx, SCAFFOLD, FedDyn, prototype-based, robust aggregation, and fairness-aware FL methods. Experimental results reveal, for example, that robust aggregation (DnC, RFA) is most resilient to Byzantine attacks, MOON improves adaptation in moderate but not extreme skew, and fairness-oriented approaches trade-off mean accuracy for reduced variance.
Open Challenges
No consensus benchmark for federated reasoning exists; reproducibility is hampered by non-uniform codebases and missing hyperparameters; efficiency gaps for advanced regularizers; fragmented solutions along individual axes; need for generalization to non-horizontal FL and federated pretrained model integration.
2. AIBench Training (ATBench): Comprehensive AI Training Benchmark Suite
AIBench Training, sometimes denoted ATBench, is the most comprehensive, industry-standard training benchmark suite, supporting reproducible and cost-sensitive evaluation across a spectrum of AI workloads (Tang et al., 2020). It covers nineteen end-to-end tasks across image, text, speech, recommendation, and 3D domains, with testbeds including both full and minimal subsets for robust workload ranking and microarchitectural characterization.
Features and Metrics
- Exhaustive representation of machine learning workloads, engaging convolutional, recurrent, attention-based, and transformer paradigms.
- Systematic coverage of learning dynamics factors down to microarchitectural events (GPU occupancy, DRAM utilization).
- Benchmarks are categorized for full-suite evaluation, repeatable performance ranking (e.g., Image Classification, Learning-to-Rank), and workload characterization (e.g., Spatial Transformer, Speech Recognition).
- Key metrics encompass parameter count , computation per sample , convergence rate , and memory/compute access characteristics.
- Performance results confirm ATBench’s greater diversity, representativeness, and cost efficiency versus MLPerf 0.7, including superior hotspot coverage and architectural insight at a significantly reduced runtime.
3. @Bench: Benchmarking Vision-LLMs for Assistive Technologies
@Bench targets the human-centered evaluation of vision-LLMs in assistive-technology (AT) contexts for people with visual impairments (PVIs) (Jiang et al., 2024). It is constructed on foundational user studies, defining five core tasks universally relevant for PVIs: Panoptic Segmentation, Depth Estimation, Optical Character Recognition (OCR), Image Captioning, and Visual Question Answering.
Benchmark Scope and Protocols
- Datasets: ADE20K, NYU-v2, extensive synthetic and real-world OCR, VizWiz_Cap, VizWiz_VQA.
- Task-specific evaluation metrics: Panoptic Quality (PQ), RMSE for depth, accuracy for OCR/VQA, BLEU/CIDEr for captioning.
- Emphasis on multi-task learning and real-world device constraints (model size, latency).
- Baselines include state-of-the-art generalist VLMs (e.g., Unified-IO, X-Decoder) and a unified model (@Model) using a prompt-conditioned I/O interface.
Empirical Insights
Joint training boosts cross-task generalization (notably VQA and captioning), character-level tokenization is critical for OCR, and hardware-aware model compression remains challenging for future assistive deployments.
4. TabAttackBench: Benchmark for Adversarial Attacks on Tabular Data
TabAttackBench addresses evaluation gaps in adversarial robustness for models operating on tabular data, emphasizing both attack success and imperceptibility, which is uniquely nuanced in tabular domains (He et al., 27 May 2025). The benchmark defines a multi-metric assessing proximity (), sparsity, Mahalanobis deviation, and feature sensitivity, rather than relying on -norms alone.
Benchmark Design
- Datasets: Eleven standard tabular datasets, including both mixed-type (numerical + categorical, e.g., Adult, COMPAS) and numerical-only.
- Models: Logistic Regression, MLP, TabTransformer, FTTransformer.
- Attacks: FGSM, BIM, PGD, DeepFool, Carlini–Wagner (CW), plus Gaussian noise baseline.
| Attack Family | Main Strength | Key Weakness |
|---|---|---|
| (FGSM/BIM/PGD) | High attack success rate (ASR) | Low imperceptibility (large , outlier rate) |
| (DeepFool, CW) | Better imperceptibility/stealth | Some loss in ASR |
Findings
Transformer-based tabular models exhibit comparatively higher robustness; -based attacks offer superior trade-offs for stealth. Attacks often fail to perturb categorical (one-hot) features, suggesting the need for attack algorithms modeling feature dependencies and mixed-type constraints.
5. AtomBench: Generative Modeling of Atomic Structures
AtomBench is a reproducible benchmark for the evaluation of inverse-design generative models in computational materials science, focusing on crystalline structure generation conditioned on compositional and property constraints (Campbell et al., 17 Oct 2025). The benchmark covers: (i) AtomGPT (transformer-based), (ii) Crystal Diffusion VAE (CDVAE), and (iii) FlowMM (Riemannian flow matching), trained/tested on JARVIS Supercon-3D and Alexandria DS-A/B datasets.
Evaluation Metrics
- Distributional Kullback–Leibler divergence (KLD) between predicted and reference lattice parameter histograms,
- Mean Absolute Error (MAE) of lattice constants.
| Model | JARVIS Avg. KLD | JARVIS MAE (Å) | Alexandria Avg. KLD | Alexandria MAE (Å) |
|---|---|---|---|---|
| CDVAE | 0.08 | 0.07 | 0.05 | 0.05 |
| AtomGPT | 0.12 | 0.10 | 0.11 | 0.14 |
| FlowMM | 0.18 | 0.22 | 0.25 | 0.27 |
CDVAE demonstrates the lowest error across datasets, primarily attributable to its denoising diffusion decoder and joint optimization of lattice, atom, and composition spaces. AtomGPT is effective in atomic coordinate prediction due to transformer-driven long-range dependency modeling.
6. ATBench: Agent Trajectory Safety and Security Benchmark
ATBench in the context of agent safety is the Agent Trajectory Safety and Security Benchmark, introduced to quantify and diagnose safety and security risks of autonomous AI agents over long-horizon, tool-augmented trajectories (Liu et al., 26 Jan 2026). It employs a structured three-way taxonomy: Risk Source (8 classes; e.g., user input, environment, tools, agent logic), Failure Mode (14 classes; e.g., flawed reasoning, insecure execution, harmful content), and Real-World Harm (10 classes; e.g., privacy, integrity, health, equity).
Benchmark Corpus and Labeling
- 500 synthetic agent trajectories (250 safe, 250 unsafe; ~9-turn average length; 1,575 unique tools).
- Each trajectory labeled for binary safety and fine-grained (source, failure, harm) taxonomy, stratified by majority voting among multiple agentic LLMs with human review as needed.
| Dimension | Categories (examples) |
|---|---|
| Risk Source | Malicious instruction, prompt injection, tool feedback, LLM error |
| Failure Mode | Improper tool use, insecure execution, harmful output, information leakage |
| Real-World Harm | Privacy, system integrity, health, economic, fairness |
Evaluation Protocol
Binary trajectory-level safety classification (accuracy, precision, recall, F1), with fine-grained taxonomy accuracy per dimension. Closed-source LLMs, open guard models, and dedicated diagnostic models (AgentDoG variants) are compared; AgentDoG achieves F1 on safety and up to fine-grained risk-source accuracy.
7. Comparative Significance and Future Directions
While each ATBench instance targets distinct methodological and application domains, all share a structural commitment to: (i) comprehensive, multi-dimensional evaluation, (ii) rigorous metric development, (iii) transparent and reproducible protocols, and (iv) direct alignment with real-world challenges (data heterogeneity, safety, interpretability, or hardware). Implications for future research include expanded reasoning evaluation in FL, unified imperceptibility metrics for other structured data, protocol extensions to new application domains (e.g., safe multimodal agents), and broader adoption of benchmark-driven development lifecycles.
The ATBench identifier thus codifies benchmark design excellence, domain-specific rigor, and reproducibility in contemporary machine learning research.