QuarkMedSearch: Agentic Medical Deep Search

Updated 4 July 2026

QuarkMedSearch is a specialized agentic system for deep medical search in Chinese, characterized by iterative multi-hop evidence retrieval and dynamic tool invocation.
It employs a robust four-phase data synthesis pipeline combined with two-stage fine-tuning to address long-horizon search challenges and retrieval necessity.
Benchmark results demonstrate significant performance gains and improved search efficiency, underscoring the impact of vertical-domain specialization.

Searching arXiv for the QuarkMedSearch paper and closely related systems to ground the article in current papers. QuarkMedSearch is a domain-specialized long-horizon deep-search agent for the Chinese medical domain, built on top of Tongyi DeepResearch 30B-A3B and designed for medical deep search rather than ordinary medical question answering. In the formulation presented in “QuarkMedSearch: A Long-Horizon Deep Search Agent for Exploring Medical Intelligence” (Lin et al., 14 Apr 2026), the target task requires live-web and specialized-source retrieval, multi-step planning, tool invocation, evidence integration across hops, search revision, and factually grounded answer synthesis. The system is defined as a full-pipeline vertical-domain effort spanning medical multi-hop data construction, post-training, and expert-curated evaluation, with the central claim that vertical-domain agentic performance depends on the joint quality of all three components rather than on backbone scaling alone.

1. Domain definition and problem setting

QuarkMedSearch is explicitly situated in the setting of medical deep search, a sequential decision-making problem in an open environment. The paper distinguishes this from ordinary retrieval-augmented generation, where a model typically retrieves once and answers from a fixed document bundle. In QuarkMedSearch, the query is often constructed so that it cannot be solved by one retrieval call; instead, the agent must iteratively reason about missing information, choose a search direction, inspect returned evidence, revise the plan, and continue until the hidden constraints are resolved (Lin et al., 14 Apr 2026).

The system is motivated by three deficits in vertical-domain agentic modeling that the paper treats as simultaneous: the scarcity of high-quality long-horizon medical data, the absence of a domain-tailored post-training recipe, and the lack of a rigorous benchmark. The emphasis is specifically on Chinese medical search, where the paper argues that prior resources are weaker than those available for general English web-agent settings. A major concern is that strong pretrained LLMs already encode substantial medical knowledge parametrically, so they may answer many questions directly and thereby evade learning the search behavior itself. QuarkMedSearch therefore targets not only answer accuracy but also retrieval necessity, long-horizon planning, and controllable tool use.

Within this framing, “long-horizon” is operational rather than rhetorical. The synthesized tasks are often designed to require more than 10 reasoning hops, and a sampled subset is reported to induce roughly BrowseComp-level tool-use length, making the benchmarked behavior closer to deep web exploration than to conventional domain QA (Lin et al., 14 Apr 2026).

2. Agent formalization and search mechanics

The paper formalizes QuarkMedSearch with a ReAct-style interaction process. For a query $q$ , the model maintains a trajectory of reasoning states $\tau_t$ , actions $a_t$ , and observations $o_t$ . The history before step $t$ is

$\mathcal{H}_{t-1}=\langle q,(\tau_0,a_0,o_0),\ldots,(\tau_{t-1},a_{t-1},o_{t-1})\rangle,$

and each next reasoning-action pair is sampled as

$(\tau_t,a_t)\sim \pi(\cdot\mid \mathcal{H}_{t-1}),$

with the environment returning

$o_t=\mathcal{F}(a_t).$

The completed trajectory $\mathcal{H}_T$ is then used to produce the final answer $y=g(q,\mathcal{H}_T)$ (Lin et al., 14 Apr 2026).

The action space consists of search, visit, and answer. In the data-construction environment, this core space is augmented by a specialized medical retrieval tool and an LLM-based retrieval-necessity check. The paper’s intended agent capabilities are therefore broader than answer generation alone: decomposition of a complex question into subgoals, multi-hop evidence chaining, tool selection, self-correction after failed search paths, efficient search under long-context constraints, and grounded final synthesis.

This formalization matters because QuarkMedSearch is trained on trajectories rather than only on input-output pairs. The supervision target is therefore the full search process: how the model plans, how it invokes tools, how it accumulates context, and when it terminates. A representative appendix case reportedly requires 29 tool calls and traverses phototherapy physics, DNA repair biology, Fanconi-anemia genetics, clinical phenotype, 2025 literature retrieval, and hospital honor identification, illustrating the paper’s definition of genuine long-horizon search (Lin et al., 14 Apr 2026).

3. Data synthesis and post-training pipeline

The system’s most distinctive component is its four-phase medical deep-search data-construction pipeline. Phase 1 builds seed questions from a large-scale medical knowledge graph integrating real-world internet medical resources and millions of professional medical book corpora. To suppress parametric-memory shortcuts, the sampling procedure emphasizes long-tail entities using graph frequency based on in-degree and out-degree, then extracts medically meaningful subgraphs along relations such as symptom findings, associated findings, complications, contraindications, and dietary recommendations. The selected path depth is 4–6 hops, with the terminal entity used as the answer and the path constraints converted into natural-language questions. Multiple strong frontier closed-source models are then used for consistency verification so that only graph-consistent seed items are retained (Lin et al., 14 Apr 2026).

Phase 2 performs online multi-hop factual expansion. Starting from a seed QA pair, the system uses four tools: general Search, Visit, Medical Professional Search, and LLM Check. The question is progressively deepened by adding real online facts until two conditions are met: the number of reasoning hops exceeds 10, and LLM Check indicates that the item can no longer be directly solved from parametric memory. This step is central to the paper’s claim that the dataset supervises retrieval necessity rather than only final correctness (Lin et al., 14 Apr 2026).

Phase 3 introduces entity obfuscation to push difficulty toward BrowseComp-like levels. Temporal, location, person, medical, and numerical entities are rewritten through descriptive substitutions rather than direct lexical mentions. The process is iterative across four axes—replacement naturalness, coverage completeness, difficulty effectiveness, and answer uniqueness—and the paper requires a minimum of five rounds of self-iteration before a sample can proceed (Lin et al., 14 Apr 2026).

Phase 4 enforces uniqueness and correctness through single-model multi-rollout verification, cross-model validation, and data recovery for transient API or parsing failures. The reported difficulty proxy is tool-call count: on a sampled subset of 500 synthesized questions, the average is 27.2 tool calls, compared with 32.3 for BrowseComp under Tongyi DeepResearch evaluation, with similar distributional shape (Lin et al., 14 Apr 2026).

Post-training follows the data pipeline with a two-stage regimen. First, supervised fine-tuning is split into a short-trajectory phase of at most 32K tokens and a long-trajectory phase of 32K to 128K tokens, with 20% short trajectories mixed into the second phase to preserve short-task stability. Observation spans are masked from gradient updates, and open-source general deep-search trajectories are mixed in to preserve general capability. Second, the model is optimized with RLVR using GRPO-style training and a deliberately strict reward designed to reduce reward hacking:

$\tau_t$ 0

This design implements the paper’s principle that no reward should be assigned unless the answer is correct. RL data are restricted to boundary samples of moderate difficulty, and training is supported by asynchronous rollout execution and asynchronous reward computation to reduce long-trajectory synchronization bottlenecks (Lin et al., 14 Apr 2026).

4. Benchmark design and empirical performance

Evaluation is centered on the QuarkMedSearch Benchmark, a manually verified benchmark for Chinese medical long-horizon deep search. Candidate questions are sourced both from medically relevant high-difficulty items in open-source benchmarks such as BrowseComp and HLE and from the paper’s own synthesis pipeline. Contamination control is performed at the entity level: no key benchmark entity appears among the core entities used in SFT or RL training data. Medical experts review each candidate for linguistic clarity, question difficulty, and ground-truth correctness and uniqueness (Lin et al., 14 Apr 2026).

The final benchmark contains 140 human-verified questions spanning six categories: Biomedical Fundamentals, Drugs and Medical Products, Medical Research and Knowledge, Diseases and Clinical Manifestations, Clinical Procedures, and Medical Institutions. The main metric is Avg@3, under which each question is sampled three times and the scores are averaged. Auxiliary analyses use termination rate and average tool calls on correctly answered questions (Lin et al., 14 Apr 2026).

The principal quantitative results are summarized below.

Setting	Backbone	QuarkMedSearch
QuarkMedSearch Benchmark	40.71	55.71
BrowseComp-EN	42.67	47.03
BrowseComp-ZH	45.40	57.55
Xbench DeepSearch	74.0	81.0

On its own benchmark, QuarkMedSearch improves from 40.71 to 55.71 over Tongyi DeepResearch 30B-A3B, a gain of 15.00 points. The paper characterizes this as state-of-the-art among open-source models of comparable scale and notes parity with Seed1.8 at 55.71 while approaching Kimi-K2.5 at 56.74, DeepSeek-V3.2 at 57.14, and GLM-5 at 58.57 (Lin et al., 14 Apr 2026).

Ablation results attribute part of the gain to long-context supervision. The two-phase SFT path raises the medical benchmark from 40.71 to 49.99 after the 32K stage and to 52.21 after the 128K stage, while also improving BrowseComp-EN and BrowseComp-ZH. RL then adds further gains, including a rise in BrowseComp-ZH termination rate from 0.676 to 0.725 and a drop in average tool calls on correct examples from 28.09 to 24.01; on BrowseComp-EN the corresponding changes are 0.580 to 0.635 and 40.03 to 37.12. The paper interprets this as evidence that RL improves both success and search efficiency (Lin et al., 14 Apr 2026).

One especially notable result concerns context management. When context overflow is handled by a “Discard-all” strategy that clears the interaction history except for the original question and minimal task description, QuarkMedSearch rises from 47.03 to 57.61 on BrowseComp-EN and from 57.55 to 67.13 on BrowseComp-ZH. This suggests that many failures under long-horizon search are attributable to poor retrieval trajectories rather than to lack of domain knowledge, and that restart-under-budget can rescue such failures (Lin et al., 14 Apr 2026).

5. Position within medical and Quark search research

Within medical search research, QuarkMedSearch represents a different design lineage from systems such as CupQ. CupQ is a clinically oriented literature search engine built on PubMed/MEDLINE data, with lexical retrieval followed by title-focused semantic reranking based on Word2Vec cosine similarity, journal impact factor, and publication date, and with result presentation organized into reviews, guidelines, and studies (Wang et al., 2019). QuarkMedSearch, by contrast, is an agentic search system defined over live web and specialized medical tools, trained on long trajectories, and evaluated on tasks designed to require retrieval, planning, and reflection rather than on title-centric document ranking alone (Lin et al., 14 Apr 2026). This suggests a shift from curated corpus ranking toward sequential evidence-seeking in open environments.

Within the broader Quark ecosystem, a distinct but relevant line is “From Item-Only to Query-Item: Query-Conditioned Generative Search with QGS in Quark” (Song et al., 25 May 2026). QGS is not presented as a medical-domain system; it addresses the ranking module of Quark Search and introduces query-conditioned next-item prediction, a Linear HSTU encoder with per-layer complexity reduced from $\tau_t$ 1 to $\tau_t$ 2, and HFG-Attention for heterogeneous engineered features. It is deployed in Quark Search and reports statistically significant online gains of +0.62% CTR, +0.38% Click-Search Ratio, and +3.55% PV Duration (Song et al., 25 May 2026). A plausible implication is that QuarkMedSearch and QGS occupy complementary layers of a search stack: QuarkMedSearch targets long-horizon domain exploration and evidence synthesis, whereas QGS targets low-latency industrial ranking under query-conditioned behavior.

This distinction is methodologically important. QuarkMedSearch treats search as a trajectory problem with explicit planning and reflection. CupQ treats search primarily as clinically tuned reranking over a curated biomedical corpus. QGS treats search primarily as query-conditioned ranking over user interaction sequences. Together, these systems delineate three different technical responses to medical or search-specific relevance: curated retrieval and reranking, trajectory-level agentic exploration, and industrial query-conditioned ranking.

6. Limitations, risks, and significance

The paper states or implies several limitations. QuarkMedSearch is focused on the Chinese medical domain, and transferability to other languages or healthcare ecosystems is not established. Exact synthesized dataset composition and scale are not fully disclosed: the main text states only “tens of thousands” of tasks and omits exact per-phase counts, source composition, and some low-level heuristics. The benchmark contains only 140 questions, which is carefully curated but small relative to the breadth of medical search. Tool-call count is used as a difficulty proxy, but it is not a direct measure of reasoning depth (Lin et al., 14 Apr 2026).

The safety profile is also explicitly bounded. Because the system operates in medicine, the paper highlights risks of hallucinated medical facts, over-trust in non-authoritative web sources, failure to detect ambiguity, brittle reasoning under live-web shifts, and unsafe deployment if the model is treated as an autonomous clinical decision-maker. The work emphasizes source authority, factual rigor, answer uniqueness, and expert verification, but it does not claim clinical-grade deployment safety; it is positioned as a research and search-assistance system rather than as an autonomous medical decision-maker (Lin et al., 14 Apr 2026).

Its significance therefore lies less in proposing a new backbone than in establishing a vertical-domain recipe. The paper’s core claim is that medical agentic capability depends on specialized data construction, long-horizon trajectory supervision, carefully designed RL rewards, and domain-specific evaluation. The reported outcome—that a 30B-A3B model can improve from 40.71 to 55.71 on an expert-built medical benchmark while also improving on general deep-search benchmarks—supports a broader thesis: domain specialization in agentic systems can be achieved effectively by specializing the entire pipeline, not only the model weights (Lin et al., 14 Apr 2026).