Sequential Diagnosis Benchmarks

Updated 7 August 2025

Sequential diagnosis benchmarks are systematic frameworks that guide adaptive test sequencing, balancing diagnostic accuracy with cost and resource constraints.
They integrate probabilistic, decision-theoretic, and data-driven methods to optimize information acquisition in domains from industrial fault localization to clinical diagnostics.
Key datasets and quality metrics enable scalable evaluations, driving advancements in efficient diagnostic policies and robust sequential reasoning.

Sequential diagnosis benchmarks provide systematic frameworks, datasets, and algorithms for evaluating sequential decision-making strategies that identify system states or pathologies through a series of adaptive tests or observations. Unlike static diagnostic evaluations, these benchmarks emphasize stepwise data acquisition, optimizing not only for diagnostic accuracy but also for resource or cost efficiency. Both model-based and data-driven approaches are represented, with applications ranging from industrial fault localization and medical diagnosis to cognitive assessment and complex reasoning in LLMs.

1. Foundational Models and Methodologies

Early sequential diagnosis benchmarks are anchored in probabilistic and decision-theoretic models. The Bayesian framework for sequential change diagnosis (0710.4847) formalizes the joint detection and identification of regime changes in random sequences. This formulation unifies optimal stopping theory and Bayesian multi-hypothesis testing, with a risk functional that combines per-period delay costs and misclassification penalties:

$R(\delta) = c\,\mathbb{E}[(\tau-\theta)^+] + \mathbb{E}[a_{0d}\,\mathbb{I}\{\tau < \theta\} + a_{\mu d}\,\mathbb{I}\{\theta \leq \tau < \infty\}]$

where $\tau$ is the stopping time, $d$ the diagnosis, $\theta$ the unknown change time, and $\mu$ the post-change regime. The sufficient statistic is a recursively updated posterior probability vector, reducing the high-dimensional observation history to a tractable Bayesian filter.

Another canonical approach is the decision-analytic test sequencing rule, which, given known test costs $c_i$ and failure probabilities $p_i$ , selects the next diagnostic action by minimizing the $c_i/p_i$ ratio (Kalagnanam et al., 2013). The sequence is ordered to minimize expected diagnostic cost:

$\text{Test-order} = \operatorname{argsort}_i \left( \frac{c_i}{p_i} \right)$

For system-level model-based benchmarks, the FRACTAL framework (Feldman et al., 2014) integrates automated test pattern generation (ATPG) with active, greedy, or probing-based reduction of the diagnosis space. These approaches formalize diagnosis as an expected-uncertainty minimization in the space of minimal-cardinality diagnoses, applying geometric decay models to benchmark convergence rates.

2. Algorithmic Advances and Scalability

With increasing system complexity and data dimensionality, computationally efficient and scalable methods have emerged. For instance, the sequential Gibbs sampling algorithm for cognitive diagnosis models (Wang et al., 2020) scales Bayesian inference for high-dimensional binary attribute vectors from $\mathcal{O}(2^K)$ to $\mathcal{O}(K)$ via coordinate-wise sampling of attributes, maintaining estimation fidelity while enabling benchmarking in large-scale educational and psychometric CDMs.

Abstraction and hierarchy play a significant role in scaling sequential diagnosis to large engineered systems. The d-DNNF-based abstraction method (Siddiqi et al., 2014) leverages structural decomposition (cones) and component cloning to greatly reduce diagnostic search space. A novel, structure-driven cost estimation function—combining recursive logarithmic isolation and abstraction penalties—guides optimal abstraction selection, tightly correlating with observed benchmark costs on standard ISCAS-85 circuits.

Dynamic stateful tree-based diagnostic search (DynamicHS (Rodler, 2020, Rodler, 2019)) further optimizes the diagnosis process by maintaining minimal hitting set trees across iterative query steps, drastically reducing redundant recomputation and reasoner calls in knowledge bases and ontology debugging scenarios. Memory overhead is mitigated by efficient duplicate/pruning policies.

3. Benchmark Datasets and Quality Metrics

A representative spectrum of benchmarks now covers domains ranging from engineered circuits and software to rich medical and cognitive data. For mechanical fault diagnosis, motorcycle engine test sets (Kalagnanam et al., 2013) encapsulate test-action costs validated by expert mechanics, while the ISCAS and 74XXX combinational logic circuits (Feldman et al., 2014, Siddiqi et al., 2014) provide model-checking and test selection testbeds for active diagnosis algorithms.

In the biomedical domain, benchmarks include sequential dermoscopic imaging datasets for melanoma (Yu et al., 2020), the MedlinePlus and SymCat knowledge bases for simulated clinical interviews (Yuan et al., 2021), and large-scale EHR records (e.g., MIMIC-III/IV (Peng et al., 2021, Koo, 2024)) for sequential diagnosis prediction, with provisions for simulating incomplete or imbalanced data modalities.

Language-model-based benchmarks push diagnostic reasoning beyond static questions. DiagnosisArena (Zhu et al., 20 May 2025) synthesizes 1,113 multi-specialty case reports from top medical journals, requiring stepwise synthesis of examination, testing, and diagnosis. The Sequential Diagnosis Benchmark (Nori et al., 27 Jun 2025) transforms NEJM clinicopathological cases into interactive encounters with a simulated "Gatekeeper" for progressive disclosure and cost accounting.

Benchmark quality is now directly evaluated through data quality metrics. The Data Quality Index (DQI) (Mishra et al., 2020) systematically quantifies spurious bias and diversity via multi-component metrics such as vocabulary richness, intra-/inter-sample similarity, and n-gram distribution, supporting robust model–agnostic dataset assessment for sequential diagnostic datasets.

4. Cost, Accuracy, and Policy Evaluation

Modern benchmarks explicitly account for not only diagnostic accuracy but also the resource efficiency of the sequential process. Cost functionals typically include per-test or per-visit financial costs, test-dependent acquisition costs, or entropy-based measures of uncertainty reduction. For example, the Sequential Diagnosis Benchmark (Nori et al., 27 Jun 2025) aggregates the cumulative cost of ordered tests and provider encounters:

$\text{Total Cost} = \sum_i \text{NumTests}_i \times C_i + N_\text{physician} \times \$300$

with diagnostic accuracy adjudicated on a multi-point scale that jointly considers anatomic, etiologic, and specificity criteria.

Policy evaluation may analyze the convergence of diagnosis as a function of sample efficiency (e.g., number of queries (Rodler et al., 2018)), exponential decay rates in remaining hypotheses (Feldman et al., 2014), or robustness under noise, error, or misleading prior information (Kalagnanam et al., 2013, Rodler et al., 2018). Active learning heuristics (e.g., entropy, split-in-half, risk-optimizing) are benchmarked over multiple real-world domains to quantify dependence on fault probability bias, sample size, and oracle quality.

5. Advanced Reasoning, Coordination, and Multimodal Extensions

Newer benchmarks interrogate advanced reasoning strategies, coordination among multiple expert agents or AI "personas," and multimodal data fusion. The MAI Diagnostic Orchestrator (MAI-DxO) (Nori et al., 27 Jun 2025) exemplifies orchestrated, panel-based diagnostic reasoning by simulating collaborative physician roles (hypothesis tracking, test selection, stewardship) for stepwise information acquisition and cost management. In experiments, MAI-DxO achieved diagnostic accuracy of 80%–85.5%—substantially exceeding human generalists and reducing diagnostic costs by 20–70% compared to non-orchestrated LLMs.

Multimodal benchmarks now address real-world challenges such as visit sequence missingness and modality dominance. NECHO v2 (Koo, 2024) applies systematic knowledge distillation from a completeness-sensitive teacher to a missing-data-aware student, combining curriculum-guided data erasing, cross-modal transformer distillation, and hierarchical dual-logit alignment to achieve robust sequential diagnosis even under uncertain and imbalanced data incompleteness.

6. Applications and Impact Domains

Sequential diagnosis benchmarks form the backbone of rigorous evaluation in diverse fields. Industrial process monitoring and target identification (0710.4847, Feldman et al., 2014), quality control (0710.4847), personalized medicine and fault triage (Wang, 2015), differential dementia diagnosis from lab-test sequences (Xing et al., 21 Feb 2025), point-of-care MRI task selection (Du et al., 7 May 2025), and clinical symptom inquiring and adaptive diagnosis (Yuan et al., 2021) are all domains represented by dedicated benchmarks. In each, sequential approaches enable earlier, more cost-effective, and more granular detection than static or naive alternatives.

The breadth of these benchmarks has revealed enduring challenges: the curse of dimensionality in belief space (Wang, 2015), the sensitivity to data quality and bias (Mishra et al., 2020), and the limitations of heuristic or black-box performance metrics for future-facing model development.

7. Reflections and Future Directions

Sequential diagnosis benchmarks have evolved significantly, integrating advances in optimal control, approximate inference, reinforcement learning, deep sequence modeling, and meta-evaluation. Contemporary benchmarks push beyond accuracy to demand resource efficiency, robustness to missingness and bias, and transparency of reasoning.

Key open directions include scalable optimal policy computation for high-dimensional/multimodal sequences, rigorous quality metric application, synthesis of richer agent collaboration paradigms, and fully open-ended clinical reasoning. As diagnostic AI permeates real-world settings, the ongoing refinement of systematic, high-fidelity sequential diagnosis benchmarks remains critical for the development and safe deployment of clinically and industrially relevant systems.