CURE-Bench: Therapeutic Reasoning Benchmark
- CURE-Bench is a benchmark for therapeutic reasoning systems that integrates multi-dimensional evaluation of answer accuracy, tool utilization, and expert-reviewed reasoning quality.
- It simulates real-world clinical challenges by incorporating drug recommendation, treatment planning, and adverse-effect prediction through diverse question formats.
- Its design combines automated metrics with expert human review to ensure procedural correctness and robust retrieval, emphasizing clinical safety in therapeutic decision-making.
CURE-Bench is a NeurIPS 2025 challenge benchmark for therapeutic-reasoning systems under clinically relevant conditions. It is designed for settings in which therapeutic decision-making is not merely a question-answering problem, but a high-stakes process requiring safe, multi-step reasoning over patient characteristics, disease processes, pharmacological agents, contraindications, side effects, and current biomedical knowledge. Its defining feature is a multi-dimensional evaluation that treats not only final answer correctness, but also tool utilization and reasoning quality, as first-class targets of assessment, with expert human review in the evaluation loop (Cofala et al., 12 Dec 2025).
1. Clinical problem setting
CURE-Bench is motivated by the structure of clinical therapeutics itself. Drug recommendation, treatment planning, and adverse-effect prediction require models to reason across interacting patient state, disease context, and drug properties, while remaining grounded in reliable and up-to-date biomedical evidence. In this setting, failures are not merely factual errors; they can become clinical safety risks. This is why the benchmark is framed around therapeutic reliability, rather than around generic biomedical QA performance (Cofala et al., 12 Dec 2025).
The benchmark therefore operationalizes a stricter notion of medical capability than answer-only evaluation. The underlying premise is that a therapeutic agent must reason coherently, retrieve relevant evidence, invoke the correct tools, and use the returned information appropriately. This suggests a benchmark philosophy in which safe medical AI is inseparable from procedural correctness: the route taken to an answer matters because flawed retrieval or malformed tool use can propagate into harmful recommendations.
A common misconception is to view CURE-Bench as a static knowledge test. The reported framing does not support that interpretation. The benchmark is instead positioned as an evaluation of whether a system can function as a tool-using therapeutic assistant in the presence of dynamic biomedical information and stringent safety constraints (Cofala et al., 12 Dec 2025).
2. Task formats and benchmark composition
The task suite spans several therapeutic reasoning modes, explicitly including drug recommendation, treatment planning, and adverse-effect prediction. The benchmark also probes multiple response formats, which are intended to separate generative reasoning from discriminative selection and from explanation-choice alignment (Cofala et al., 12 Dec 2025).
| Question style | Definition |
|---|---|
| OE | free-text answer expected |
| MC | four options are provided, and the model must choose one |
| OE-MC | the model first generates an open-ended answer and then uses that as context to select the correct multiple-choice option |
The reported dataset sizes are 459 validation questions and two test sets of 2,097 and 2,491 questions. A later competition-oriented analysis refers to phase1 as the public leaderboard or testset_phase1, and phase2 as the private leaderboard or testset_phase2, indicating a benchmark workflow with both public and private evaluation surfaces (Cofala et al., 12 Dec 2025).
The three question styles are methodologically important. OE stresses free-form therapeutic generation; MC isolates constrained selection; OE-MC tests whether a model can align an internally generated answer with a final option choice. A plausible implication is that CURE-Bench is designed to expose inconsistencies between latent reasoning and externally committed answers, rather than to reward label selection alone.
3. Evaluation dimensions
CURE-Bench does not evaluate only final answer correctness. The reported framing states that the challenge combines “metrics for answer accuracy, tool utilization, and reasoning validity with expert human review,” and that, in medical settings, both the reasoning trace and the sequence of tool invocations are critical because failures in either can lead to unsafe downstream recommendations (Cofala et al., 12 Dec 2025).
This evaluation design treats token-level reasoning and tool-usage behaviors as explicit supervision signals. The benchmark’s significance lies in the fact that it scores not just whether a system arrives at the correct answer, but whether it does so through clinically defensible tool use and coherent reasoning. That makes CURE-Bench structurally different from benchmarks that collapse all model behavior into a single end-label metric.
The emphasis on expert human review is also consequential. Although the available description does not provide a full formal metric table, it is explicit that correctness, tool utilization, and reasoning validity are all part of the evaluation picture. This suggests that benchmark success is defined by a hybrid of automated and expert-mediated assessment, reflecting the benchmark’s orientation toward clinical safety rather than leaderboard minimalism.
4. Tool-augmented agentic reasoning
The benchmark is closely associated with agentic therapeutic reasoning, especially as analyzed through TxAgent. TxAgent is described as an agentic therapeutic-reasoning system built on a fine-tuned Llama-3.1-8B backbone, paired with a smaller Qwen2-1.5B model, and connected to ToolUniverse, a unified biomedical tool suite integrating resources such as FDA drug data, OpenTargets, and Monarch. Its iterative retrieval-augmented workflow is termed ToolRAG (Cofala et al., 12 Dec 2025).
The reported workflow is sequential and explicit. Given a therapeutic question, the LLM first rewrites the query to make the intention explicit. That rewritten query is compared against ToolUniverse function descriptions. The Qwen2-1.5B component returns the most promising function calls, after which Llama-3.1-8B performs tool selection. TxAgent then decides which tools to call, how many calls to make, and how to fill in parameters. The calls are generated in JSON format, parsed by the framework, and executed by ToolUniverse in an external loop. Retrieved information is fed back to the LLM, which decides whether another retrieval round is needed. The loop repeats until enough information has been gathered to answer the question (Cofala et al., 12 Dec 2025).
This architecture matters for understanding CURE-Bench because it makes tool interaction a visible part of benchmark behavior. In other words, the benchmark is not merely compatible with agentic systems; it is structured so that the quality of tool selection, tool invocation, and evidence reuse becomes central to measured therapeutic competence.
5. Retrieval as a benchmark bottleneck
A major empirical conclusion drawn from CURE-Bench participation is that function/tool-call retrieval is a frequent source of downstream reasoning errors. The reported recurring issues include repeated function calls due to incorrectly formatted parameter names, incorrect function selection even when better candidates were retrieved, and tool calls that fail to return the needed information. In this formulation, retrieval quality is not an implementation detail but a direct determinant of therapeutic reliability (Cofala et al., 12 Dec 2025).
TxAgent’s original tool-selection mechanism is described as a finetuned encoder-decoder pipeline in which Llama-3.1-8B rewrites the question and Qwen2-1.5B compares that rewrite to tool descriptions, returning the top- tool names based on cosine similarity. The reported comparison across retrievers yields several benchmark-relevant observations: BM25 struggles because tool descriptions are short and rely on lexical overlap; dense retrievers perform similarly but with runtime variability; TxAgent’s specialized retriever performs best among the compared baselines because it was fine-tuned on therapeutic questions and tool-call labels; and integrating DailyMed improves performance further by providing richer and more up-to-date drug-label information (Cofala et al., 12 Dec 2025).
The DailyMed result has a specific safety interpretation. The reported analysis argues that openFDA tools are useful for granular metadata but often lack the full context needed for robust medical reasoning in a single call, whereas DailyMed provides authoritative Structured Product Labeling (SPL) with complete, version-controlled clinical narratives. Registering DailyMed in ToolUniverse with a semantic description allows TxAgent to discover and invoke it via the same tool-selection mechanism. This suggests that benchmark performance improvements can reflect not only better retrieval ranking, but also access to clinically richer evidence sources.
The paper also reports a fixed retrieval experiment in which retrieved context is held constant and models are evaluated using a tool-query (TQ) prompt consisting of retrieved information followed by the query, without TxAgent’s more specialized agentic prompting. In that setting, performance drops without retrieval, and permuting answer options in MC and OE-MC usually lowers accuracy further. Fine-tuning improves the ability to use context: TxAgent’s fine-tuned Llama-3.1-8B outperforms a non-fine-tuned Llama-3.1-8B with the same context. The authors also observe that some smaller models can still leverage retrieved context effectively, suggesting that context-aware therapeutic reasoning can be cost-efficient when retrieval is strong (Cofala et al., 12 Dec 2025).
The TxAgent-centered CURE-Bench work received the Excellence Award in Open Science, and explicitly attributes performance gains to improved retrieval for function calls and to DailyMed integration. Within the benchmark’s own logic, this outcome reinforces the view that tool-retrieval quality is a primary variable in therapeutic agent performance (Cofala et al., 12 Dec 2025).
6. Competition evolution, failure modes, and later systems
A later benchmark-focused study presents CureAgent as a competition-winning, training-free Executor-Analyst framework for CURE-Bench. That work frames the benchmark as one for therapeutic decision-making at scale, where systems must access a large ToolUniverse, perform multi-step structured reasoning, ground final clinical answers in retrieved evidence, and handle dynamic biomedical information. The paper’s central diagnosis is that the main bottleneck is not evidence access alone, but a Context Utilization Failure: the agent successfully retrieves sufficient biomedical evidence, yet fails to ground its final clinical decision in that evidence (Xie et al., 5 Dec 2025).
A validation failure analysis over 413 multiple-choice questions, including 73 failed cases, identifies four main error categories. The dominant category is Reasoning / Retrieval Failure, accounting for 65.8% of failed cases; Output Parsing Errors account for 19.2%; and Instruction Adherence Failures account for 12.3%. Context Utilization Failure is highlighted as the most important conceptual subtype: retrieval succeeds, but evidentiary grounding fails (Xie et al., 5 Dec 2025).
CureAgent addresses this by separating responsibilities. The Executor is implemented using TxAgent, based on Llama-3.1-8B, and is responsible for tool retrieval and evidence gathering. The Analyst is a long-context reasoning model, Gemini 2.5 Flash in the main experiments, which receives the aggregated evidence and performs reasoning without needing to execute tools. A deterministic post-processing module handles regex-based format calibration, deduplication, and output-schema enforcement (Xie et al., 5 Dec 2025).
The paper compares two collaboration topologies. Config A: Global Pooling (Early Fusion) pools all Executor outputs into a single context before analysis. Config B: Stratified Ensemble (Late Fusion) partitions the Executor budget into parallel subgroups, lets each subgroup aggregate evidence independently, sends each unique context to its own Analyst, and then fuses final answers by self-consistency. The reported argument is that global pooling creates an information bottleneck, whereas Stratified Ensemble preserves evidentiary diversity longer (Xie et al., 5 Dec 2025).
The reported phase2 benchmark numbers are central to the benchmark’s competition history. TxAgent alone achieves 69.325 on phase2. Gemini 2.5 Flash without tools scores 63.104, and with search 69.627. Collaborative variants improve further: TxAgent + Gemini 2.5 Flash, Config A, 30 Executors / 3 Analysts reaches 80.510; Config B, 10 Executors / 3 Analysts reaches 81.367; and Config B, 10 Executors / 3 Analysts + search reaches 83.803, which is the best reported leaderboard configuration (Xie et al., 5 Dec 2025).
These results reinforce the benchmark’s underlying thesis. CURE-Bench is not well characterized as a fluent medical QA benchmark, and not even as a retrieval benchmark in the narrow sense. Its reported behavior indicates a three-part difficulty: selecting the right tools, invoking them correctly, and performing evidence-grounded therapeutic synthesis. The broader lesson drawn across the benchmark literature is that tool-usage supervision and retrieval quality are central to safety, reliability, and clinical usefulness, not optional auxiliary concerns (Cofala et al., 12 Dec 2025).