Sequential Diagnosis Benchmark

Updated 1 July 2025

Sequential diagnosis benchmarks are rigorously constructed datasets and protocols that assess stepwise diagnostic reasoning by iteratively gathering evidence under cost and order constraints.
They integrate statistical, combinatorial, abstraction-based, and heuristic AI approaches to optimize measurement selection and reduce diagnostic ambiguity.
Applications span clinical, cognitive, and engineered systems, enabling comparative evaluation of diagnostic accuracy, efficiency, and cost-effectiveness.

Sequential diagnosis benchmarks are rigorous, systematically constructed datasets and protocols for evaluating the capability of algorithms or AI systems to perform stepwise, evidence-driven reasoning in domains where the correct outcome depends on incremental, adaptive information gathering. These benchmarks enable comparative assessment of diagnostic strategies, particularly under realistic constraints reflecting domain-specific challenges such as cost, ambiguity, sequence order, and the need to request specific information. The approaches and datasets summarized here span foundational statistical theory, formal frameworks for combinatorial systems, and advanced interactive evaluations for clinical and long-context reasoning.

1. Foundational Bayesian and Model-Based Benchmarks

Sequential diagnosis was historically formalized in the Bayesian sequential change diagnosis framework, which considers a sequence of i.i.d. random variables whose distribution changes at an unknown time and to an unknown new regime. The key objective is to optimally detect and identify this regime change as quickly and accurately as possible, trading off the cost of delay, false alarms, and false identifications (0710.4847). The optimal approach reduces the problem to a Markov optimal stopping problem on the sufficient statistic (posterior probability vector) and yields a decision strategy characterized by convex stopping regions in the probability simplex:

At each time, the posterior $\Pi_n$ is updated recursively via

$\Pi_{n+1}^{(i)} = \frac{D_i(\Pi_n, X_{n+1})}{D(\Pi_n, X_{n+1})}$

with efficient numerical solution via value iteration on a discretized probability simplex.

The strategy is geometrically characterized, with stopping and continuation regions corresponding to convex areas of the simplex.
This framework subsumes classical change detection (Shiryaev's problem) and sequential multi-hypothesis testing.

In the domain of engineered systems, especially digital circuits, active testing and model-based diagnosis frameworks—such as FRACTAL—define benchmarks built upon combinatorial circuit suites like ISCAS85/74XXX (1401.3850). The diagnostic process involves iteratively selecting optimal control assignments or probes to reduce the number of remaining minimal-cardinality diagnoses, emphasizing:

Measurement selection heuristics to maximize information gain per test.
Empirical metrics such as geometric decay rate in the number of diagnoses and diagnostic cost.
Practical, large-scale experiments on standard circuit benchmarks, supporting rigorous algorithm comparison.

2. Abstraction and Scalability Techniques

For larger systems prone to combinatorial explosion in possible diagnoses, the Sequential Diagnosis by Abstraction (SDA) framework introduces scalable abstractions to enable efficient diagnosis (1401.3892):

Structural techniques—probabilistic hierarchical diagnosis, component cloning, and cost estimation—are combined to handle large system sizes.
Efficient candidate measurement selection is achieved by compiling system models into d-DNNF forms, enabling rapid calculation of posterior probabilities and entropies for variable selection.
Benchmarks rely on ISCAS-85 circuits, with empirical analysis demonstrating scalability to systems containing thousands of components, except in pathologically flat topologies.

Cost estimation functions for abstraction selection are explicitly derived, allowing for optimal tradeoff between abstraction size, diagnostic tractability, and expected measurement cost.

3. Sequential Multi-class and Cognitive Diagnosis Benchmarks

In scenarios of noisy, multi-class sequential hypothesis testing, benchmarks are set by the optimal sequential multi-class diagnosis framework (1506.08915). This approach transforms the classic high-dimensional POMDP (belief space) into a low-dimensional problem by exploiting exponential tilting relationships among class distributions:

For common observation models (e.g., normal, binomial, Poisson), reachable beliefs lie on $r$ -dimensional manifolds, with $r \ll N$ (number of classes).
The optimal diagnostic strategy is computable via dynamic programming in $r$ dimensions, with posterior beliefs reconstructed from low-dimensional diagnostic statistics.
This yields substantial performance improvements (up to 66% lower total cost) compared to multi-hypothesis sequential probability ratio tests (MSPRT), particularly in quick/noisy diagnosis settings.

For educational and psychological assessment, the sequential Gibbs sampling algorithm serves as a benchmark for Bayesian estimation in cognitive diagnosis models where the number of latent attributes may be large (2006.13790). This algorithm iteratively samples individual attributes for each subject, dramatically reducing complexity from $O(2^K)$ to $O(K)$ , while retaining full dependence among attributes.

4. Heuristic and AI-driven Benchmarks

Benchmarks that critically explore heuristic selection highlight the impact of measurement selection strategies in real-world systems with numerous diagnostic ambiguities (1807.03083). Key features include:

A spectrum of active learning heuristics (entropy, split-in-half, KL divergence, most probable singleton, risk-optimization).
Empirical evaluation on knowledge base diagnosis tasks, with metrics such as query count to unequivocal identification and scenario-dependent performance overhead.
The finding that no single strategy dominates across all scenarios, and poor heuristic choice can increase diagnostic cost by over 250%.

Automated search-based approaches, such as DynamicHS, optimize classical model-based diagnostic search trees by incrementally maintaining state across diagnostic sessions (1907.12130, 2012.11078):

DynamicHS avoids redundant recomputation in sequential diagnosis by maintaining a persistent hitting set tree, optimizing both computational efficiency and number of expensive reasoner calls.
Benchmarks built on real-world, expressive knowledge bases and ontologies confirm median computational savings of 52% and maximal savings up to 89% over stateless approaches.

5. Clinical Sequential Diagnosis Benchmarks

Recent work elevates clinical diagnostic evaluation beyond static vignettes using benchmarks that enforce sequential evidence gathering, cost sensitivity, and iterative action selection.

The Sequential Diagnosis Benchmark (2506.22405) (SDBench) transforms 304 NEJM Clinicopathological Conference cases into stepwise encounters, where agents (physicians or AI) must:

Begin with a minimal case vignette and iteratively request further details or tests via explicit queries to a gatekeeper model.
Pay real (simulated) monetary costs for each physician visit and ordered test, with cost mappings based on US hospital prices.
Make final diagnoses, adjudicated using a detailed rubric by a Judge model, with correctness defined as score $\geq 4$ on a 5-point Likert scale.
Benchmark performance via both diagnostic accuracy and total diagnostic cost.

A model-agnostic orchestrator, MAI-DxO, structures AI reasoning by simulating a virtual panel of physicians, with distinct personas focused on hypothesis ranking, test value selection, cost minimization, and adversarial reasoning. MAI-DxO, when paired with state-of-the-art LLMs, achieves diagnostic accuracy of 80–85.5% while reducing cost by 20–70% compared to both unaided physicians and off-the-shelf LLMs. These gains are robust across a wide range of model families (OpenAI, Gemini, Claude, Grok, DeepSeek, Llama), with greatest absolute improvements for weaker models.

In contrast, DiagnosisArena (2505.14107) focuses on open-ended diagnostic reasoning with 1,113 cases from 28 specialties, requiring generation of up to five diagnoses per case. The evaluation avoids multiple-choice formats—which can artificially boost scores—and relies on GPT-4o scoring for correctness. Even the best models, such as o3-mini, achieve only 45.8% top-1 accuracy in this setting, well below thresholds for clinical safety.

6. Order, Retrieval, and Sequence Reasoning Benchmarks

Benchmarks such as STEPS (2306.04441) and Sequential-NIAH (2504.04713) extend the scope of sequential diagnosis by evaluating models' order reasoning and information retrieval from long contexts:

STEPS assesses an agent's (or model's) ability to recognize correct action order in recipes and multi-step instructions, using both classification and multi-choice formulations. Results show that zero-shot and in-context learning are inadequate for robust order reasoning; fine-tuning is required to achieve competitive results.
Sequential-NIAH challenges LLMs to extract, in correct order, multiple items ("needles") from extremely long textual contexts (up to 128K tokens) in synthetic, real, and open-domain QA scenarios. Current LLMs peak at only 63% accuracy, revealing major limitations in long-context sequential extraction and ordering capabilities.

7. Robust Sequential Diagnosis under Incomplete and Multimodal Data

Contemporary benchmarks are increasingly focused on real-world data imperfections such as missingness and multimodality. In NECHO v2 (2407.19540), the diagnosis process is benchmarked under conditions where patient visit sequences are incomplete and modalities (demographics, clinical notes, codes) may be missing in unbalanced ways:

The framework employs systematic knowledge distillation between a teacher model trained on complete data and a student exposed to incomplete data, using contrastive, hierarchical, transformer-level, and dual logit distillation losses.
Randomized erasing of individual datapoints during training and distillation aligns the data distributions between teacher and student, simulating realistic missingness.
Benchmark results on MIMIC-III show robust accuracy even under substantial or imbalanced missingness patterns, outperforming prior single-modality and knowledge distillation baselines.

Conclusion

Sequential diagnosis benchmarks define rigorous standards for evaluating the ability of algorithms and AI systems to reason iteratively, adaptively, and efficiently in the presence of partial evidence, cost constraints, and uncertainty. Across statistical, combinatorial, heuristic, and clinical domains, the benchmarks enumerated here have established both conceptual and computational advances—from Bayesian optimal stopping in stochastic systems to stepwise, cost-constrained, physician-inspired orchestration in clinical care. Progress is measured not only via accuracy but with domain-relevant cost, sample, and interpretability metrics, and the benchmarks themselves have exposed key limitations at the frontier of current AI—particularly with respect to generalization, real-world order reasoning, long-context retrieval, and resilience to incomplete data. As such, these benchmarks continue to shape and challenge the trajectory of sequential reasoning research.