DiscoVerse: AI-Driven Pharma & Robotics

Updated 30 November 2025

DiscoVerse is a dual-faceted initiative combining an AI co-scientist for pharmaceutical reverse translation with robotic manipulation benchmarks.
It employs structured multi-agent workflows and advanced semantic retrieval to achieve traceable reasoning, source-linked synthesis, and auditability.
DiscoVerse-L and the DISCOVERSE simulator support high-fidelity, long-horizon robotic tasks with robust metrics for sim-to-real transfer evaluation.

DiscoVerse is a term denoting two distinct but conceptually aligned research initiatives at the intersection of large-scale knowledge synthesis, robotics simulation, and AI-driven scientific discovery. The term arises in the context of (1) a domain-specialized, multi-agent artificial co-scientist for pharmaceutical R&D leveraging heterogeneous archival data, and (2) a family of long-horizon robotic manipulation benchmarks and simulators designed for evaluating vision-language-action (VLA) models. DiscoVerse, DiscoVerse-L, and the DISCOVERSE simulator constitute a body of research systems with a focus on traceable reasoning, persistent memory, complex state tracking, and high-fidelity simulation.

1. Domain-Specialized Multi-Agent Co-Scientist for Pharmaceutical R&D

DiscoVerse is a multi-agent, LLM-driven “co-scientist” developed to facilitate decision support and reverse translation in industrial pharmaceutical research. It specifically addresses the challenge of leveraging highly heterogeneous, longitudinal archives spanning over four decades, which include clinical and preclinical reports, meeting minutes, and molecule-referenced artifacts (Zheng et al., 23 Nov 2025).

The primary objectives are twofold: (1) to operationalize reverse translation—tracing clinical outcomes back to animal and in vitro findings, particularly from discontinued drug programs, and (2) to generate traceable, source-linked decision syntheses that preserve institutional memory and meet regulatory requirements. The system achieves this through semantic retrieval, cross-document evidence linking, and schema-driven structured output, all orchestrated by a modular agent architecture.

2. Multi-Agent System Architecture and Workflow

DiscoVerse implements a collection of specialized agents, each handling a class of operations in the evidence search, filtering, synthesis, and audit chain (Zheng et al., 23 Nov 2025). The key roles are as follows:

Classification & Decomposition Agent: Parses incoming user queries, performs domain (preclinical, clinical, strategic) and question-type classification, and generates sub-queries following schema-based rules.
Search Agent: Conducts hybrid retrieval over a chunked document database, combining semantic (e5-large-instruct), late-interaction (BGE-M3), and exact (BM25) search components.
Review Agent: Applies LLM-based reranking to prune non-relevant chunks, operating with a hard threshold (e.g., score ≥ 0.7).
Research Agents (Domain-Specialized): Extract evidence using LLM pipelines for animal (preclinical), human (clinical), or strategic/portfolio dimensions, maintaining provenance metadata.
Supervisor Agent: Coordinates aggregation and merging of findings, ensuring consistent treatment of missing or ambiguous branches.
Taxonomy Agent: Maps output into structured schemas templated with scientists’ co-designed question types, enforcing output regularity and full citation for each answer component.

Agents coordinate via structured messaging, with the Supervisor enforcing field-level commitment to single-source-of-truth while maintaining complete evidence lineages.

The pipeline proceeds as: User Query → Classification & Decomposition → Search → Review → Research → Supervisor → Taxonomy → User, with human-in-the-loop intervention points defined at schema design, midstream evidence review, and final output adjudication.

3. Core Technologies: Semantic Retrieval, Provenance, and Synthesis

DiscoVerse leverages advanced semantic retrieval to overcome the limitations of keyword matching over noisy, multimodal, and fragmented archives (Zheng et al., 23 Nov 2025). Key technical components include:

olmOCR: Vision-LLM-based OCR preserving table and formula structures.
Section-aware chunking: 512-word chunks with 64-word overlap facilitates context retention.
Hybrid embedding store: Dense (e5) and late-interaction (BGE-M3) embeddings indexed for chunk retrieval, merged with BM25 candidates under dual-threshold regime (e.g., semantic score ≥ 0.7, ColBERT ≥ 0.5).
Provenance tracking: Each evidence chunk carries source document, molecule ID, and experimental context, enabling both table-level synthesis (e.g., cross-species toxicity matrices) and field-level citation in structured outputs.
Auditable synthesis: All summary or schema entries include explicit source back-pointers, satisfying auditability and regulatory standards.

These mechanisms support not only high recall in information retrieval (≥0.99) but also traceability and scientific defensibility of synthesized answers.

4. Evaluation, Benchmarking, and Results

DiscoVerse was evaluated on a large confidential Roche repository (15,762 PDFs, 872 million BPE tokens, 180 molecules, spanning >40 years) using nine co-designed benchmark queries (Q1–Q9) (Zheng et al., 23 Nov 2025). Evaluation protocols included:

Quantitative metrics: Recall, precision, specificity, accuracy, and F1-score across seven scored queries, producing recall in the interval [0.9864, 1.0000] and precision ranging from 0.7142 to 0.9078.
Blinded expert adjudication: Domain experts classified source-linked outputs for correctness against true positive/false negative/false positive/true negative schema.
Case analysis: Discontinuation rationale and cross-phase toxicity syntheses were validated on questions requiring aggregation and cross-referencing of both preclinical and clinical findings.

Results indicated near-perfect recall, moderate precision (attributable to contextual ambiguities rather than hallucinations), and faithful, source-linked synthesis for complex, multi-factorial scientific questions.

5. The DiscoVerse-L Manipulation Benchmark and DISCOVERSE Simulator

DiscoVerse-L is a robotic manipulation benchmark composed of three long-horizon, multi-stage assembly tasks—Block Bridge (74 stages), Stack (18 stages), and Jujube-Cup (19 stages)—implemented in the high-fidelity DISCOVERSE simulator (Liu et al., 20 Nov 2025). These tasks are specified via natural language and automatically decomposed into sequential stages using video-driven LLM prompting. Each stage is annotated with a triplet: positive completion, negative failure, and hard-negative “near-miss” predicates.

The DISCOVERSE simulator (Jia et al., 29 Jul 2025) provides a unified 3D Gaussian Splatting (3DGS) and MuJoCo-based platform for photorealistic rendering and accurate physics, supporting multimodal sensing (RGB, depth, LiDAR, tactile, proprioceptive) and native ROS2 integration. Data pipelines enable hyper-realistic scene construction from RGB and laser scans, while simulation runs at high throughput (650 FPS, 5 cameras) with seamless interleaving of Python/PyTorch control.

DiscoVerse-L is designed to stress-test memory, stage-alignment, and persistent state tracking in VLA policies, notably supporting evaluation of stage hallucination, success rate, sample efficiency, and sim-to-real transfer performance.

6. Metrics, Evaluation, and Impact in Robotic Learning

Evaluation on DiscoVerse-L employs several metrics (Liu et al., 20 Nov 2025):

Stage-Aligned Reward (SAR): Dense progress signal based on CLIP-based image-text alignment, smoothed and filtered to advance stages only on sustained progress.
Hallucination Rate (HR): Fraction of time steps where high alignment scores are obtained without ground-truth completion; formalized as

$\mathrm{HR} = \frac{\mathbb{E}\left[1(u_k(t)>\theta \wedge c_k(t)=0)\right]}{\mathbb{E}\left[1(u_k(t)>\theta)\right]}, \quad \theta=0.7$

Success Rate (SR): Percentage of episodes completing all $K$ stages within 400 steps.
Sample Efficiency: Minimum environment steps to reach 50% SR.

EvoVLA (Liu et al., 20 Nov 2025), trained and evaluated on DiscoVerse-L, achieves 69.2% average SR, improving by +10.2% over the strongest baseline, with HR reduced from 38.5% to 14.8%. Real-world transfer to physical robots yields 54.6% average SR, surpassing baselines by 11–16.9 points, without real-world fine-tuning.

The combination of densely annotated tasks, ground-truth instrumentation, and high-fidelity simulation enables rigorous evaluation and advances the state of the art in long-horizon, memory-dependent robotic manipulation.

7. Broader Implications and Future Directions

DiscoVerse (co-scientist system) and DiscoVerse-L/DISCOVERSE (robotics benchmark and simulator) exemplify designs that prioritize transparent reasoning, evidence traceability, and system extensibility in AI-driven scientific discovery and embodied AI (Zheng et al., 23 Nov 2025, Liu et al., 20 Nov 2025, Jia et al., 29 Jul 2025). In pharmaceuticals, DiscoVerse provides a blueprint for regulated, auditable AI workflows embedding human-in-the-loop feedback and domain-aligned schemas. In robotics, DiscoVerse-L plus DISCOVERSE address the sim-to-real gap via photorealism, persistent memory, and dense annotation, supporting both RL and imitation learning pipelines with modular, open-source infrastructure.

A plausible implication is that these principles—provenance-centric synthesis, modular multi-agent architectures, and strongly benchmarked evaluation—will become foundational in future AI systems deployed in evidence-rich, high-stakes domains such as clinical research, materials discovery, and real-world robotics.