Papers
Topics
Authors
Recent
2000 character limit reached

RhinoInsight: LLM Deep Research Engine

Updated 1 December 2025
  • RhinoInsight is a framework powered by large language models that integrates Verifiable Checklist and Evidence Audit modules to ensure robust, traceable research outputs.
  • It employs a five-component ReAct-style loop to coordinate multi-step reasoning, tool-mediated search, and structured report generation without modifying model parameters.
  • Empirical results demonstrate its state-of-the-art performance on deep research and search benchmarks, outperforming existing agents in diverse domains and languages.

RhinoInsight is a large-language-model–driven framework for deep research tasks requiring multi-step reasoning, tool-mediated search, and long-form report generation. It primarily advances two orthogonal control mechanisms—Verifiable Checklist and Evidence Audit—to mitigate error propagation and context degradation, ensuring robustness and traceability in the outputs of generative agents. RhinoInsight operates atop a five-component ReAct-style loop, controlling both the model’s plan execution and the structuring of external evidence, without necessitating parameter updates or architectural modification. Experiments show that RhinoInsight sets a new state-of-the-art on benchmarks for deep reasoning and search, with competitive performance on diverse domains and languages (Lei et al., 24 Nov 2025).

1. Formal Structure of the Verifiable Checklist Module

At initialization (t=0t=0), RhinoInsight receives a user query qq and establishes an editable outline O0O_0 as well as a checklist C0C_0 of natural-language acceptance criteria. The checklist constrains planning, prevents non-executable goals, and enables traceable verification at every downstream step.

Checklist construction proceeds as follows:

  • Checklist Generation: C0={ci0}iChecklistGenerator(q)C_0 = \{c_i^0\}_i \leftarrow \text{ChecklistGenerator}(q); where each ci0c_i^0 is an acceptance test (e.g., "Define the scope of X").
  • Intent Refinement: Ambiguities in C0C_0 invoke Z0=PlanIntents(q,C0,s0)Z_0 = \text{PlanIntents}(q, C_0, s_0), with s0s_0 being the empty state.
  • Critic Update: A critic (human or LLM) refines (C0,Z0)(C_0, Z_0): C1=Critic(C0,Z0)C_1 = \text{Critic}(C_0, Z_0), possibly splitting, merging, or clarifying items and enforcing logical exclusivity/dependencies.
  • Hierarchical Outline Construction: The final checklist C1C_1 is organized into a tree-structured outline O1=PlanOutline(C1)O_1 = \text{PlanOutline}(C_1): each outline node is anchored to a specific checklist item, supporting multiple levels of subgoals.

The process is formally captured as: qChecklistGeneratorC0PlanIntentsZ0CriticC1PlanOutlineO1q \xrightarrow{\text{ChecklistGenerator}} C_0 \xrightarrow{\text{PlanIntents}} Z_0 \xrightarrow{\text{Critic}} C_1 \xrightarrow{\text{PlanOutline}} O_1 This structure prevents unconstrained planning and enforces that every unit of execution is linked to a verifiable acceptance test.

2. Algorithmic Description of the Evidence Audit Module

The Evidence Audit module converts unstructured tool outputs into a structured, lean, and verifiable memory EE, ensuring report claims are empirically anchored and hallucinations are minimized. Its pipeline consists of two stages.

Stage 1: Search → Memory → Outline Updates

  • For t=1Tt = 1 \dots T:

    1. Generate search tasks ZtZ_t from the outline OtO_t.
    2. Perform web searches: Rt=Search(Zt)R_t = \text{Search}(Z_t).
    3. Normalize, structure, and summarize RtR_t into EtE_t:

      Et=Et1P(S(N(Rt)))E_{t} = E_{t-1} \cup \mathcal{P}\Bigl(\mathcal{S}\bigl(\mathcal{N}(R_{t})\bigr)\Bigr)

      where

      • N\mathcal{N}: URL normalization/boilerplate removal,
      • S\mathcal{S}: grouping by source, timestamp, confidence, and outline node alignment,
      • P\mathcal{P}: summarization and persistence into EtE_t.
    4. Outline OtO_t is refined based on supporting evidence, guaranteeing \forall nodes nn, at least k1k \geq 1 supporting cluster.

Stage 2: Drafting, Evidence Extraction, Critiquing, and Report Construction

  • Upon stopping (σt=1\sigma_t=1 or t=Tt=T):
    • Draft node-aligned content (T,V,C)=Draft(Ot,Lt,Mt)(\mathcal{T}, \mathcal{V}, \mathcal{C}) = \text{Draft}(O_t, L_t, M_t).
    • Extract citations relating outline labels LtL_t to evidence EtE_t.
    • For each node nOtn \in O_{t^\star}:
    • Retrieve candidate evidence: CnRetrieve(E,n)C_n \leftarrow \text{Retrieve}(E, n).
    • Rank and prune with critic: cnRankCritic(Cn)c^\star_n \leftarrow \text{RankCritic}(C_n).
    • Compose final report RR:

      Compose(Ot,E)={contentnRankCritic(Retrieve(E,n))}nOt\text{Compose}(O_{t^\star}, E) = \{\text{content}_n \leftarrow \text{RankCritic}(\text{Retrieve}(E, n))\}_{n \in O_{t^\star}}

    • Scoring function for evidence ranking:

      s(ei)=αrel(n,ei)+βquality(ei)+γrecency(ei)+δconsistency(ei)s(e_{i}) = \alpha \cdot \text{rel}(n, e_{i}) + \beta \cdot \text{quality}(e_{i}) + \gamma \cdot \text{recency}(e_{i}) + \delta \cdot \text{consistency}(e_{i})

    • The most relevant, high-quality, consistent, and recent evidence is retained, with remaining items systematically pruned.

Pseudocode Summary:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
\textbf{Initialize:}\quad E\leftarrow\emptyset,\;O\leftarrow O_{1}
\FOR t=1\textbf{ to }T \DO
  Z_t \leftarrow \mathrm{Plan}(O)
  R_t \leftarrow \mathrm{Search}(Z_t)
  E \leftarrow E\;\cup\;\mathcal{P}(\mathcal{S}(\mathcal{N}(R_t)))
  O \leftarrow \mathrm{RefineOutline}(O,\,E)
  \text{if }\mathrm{stop}(O,E)=1\textbf{ then break}
\ENDFOR
\textbf{Draft:}\quad (\mathcal{T},\mathcal{V},\mathcal{C})\leftarrow\mathrm{Draft}(O,E)
\textbf{ExtractCritic:}\quad R\leftarrow\emptyset
\FOR each node\;n\in O\;\DO
  C_n\leftarrow\mathrm{Retrieve}(E,n)
  c^\star_n\leftarrow\mathrm{RankCritic}(C_n)
  R\leftarrow R\cup\{c^\star_n\}
\ENDFOR
\textbf{Write:}\quad R\leftarrow\mathrm{Write}(\mathcal{T},\mathcal{V},R,\mathcal{C})

3. Empirical Guarantees and Theoretical Considerations

RhinoInsight does not provide formal convergence proofs or worst-case complexity bounds for either the Verifiable Checklist or Evidence Audit modules. Its correctness and effectiveness are empirically validated through robustness and verifiability metrics, as opposed to theoretical theorems. This suggests its primary guarantees are statistical and based on observed improvements over existing baselines rather than hard deterministic constraints (Lei et al., 24 Nov 2025).

4. Benchmarking and Quantitative Performance

RhinoInsight has been evaluated on high-complexity deep research and search benchmarks:

  • Deep Research (T+V+C): DeepConsult (business/consulting objectives), DeepResearch Bench (RACE, 100 tasks across 22 domains).
  • Deep Search (T): HLE (Humanity’s Last Exam — text-only), BrowseComp-ZH (Chinese multi-hop browsing), GAIA (multi-step reasoning).
  • Model Backbone: Gemini-2.5-Pro; baselines include competitive LLM+search tools and proprietary deep research agents.

Results for DeepConsult and RACE demonstrate clear superiority, as shown below:

Benchmark Win (%) Tie (%) Lose (%) Avg Score Top Metric
DeepConsult 68.51 11.02 20.47 6.82 All
RACE: Overall 50.92 Top
RACE: Compreh. 50.51 Top
RACE: Insight 51.45 Top
RACE: Inst-Foll 51.72 Top
RACE: Readab. 50.00 Tied

On deep search tasks:

  • HLE: Accuracy = 27.1
  • BrowseComp-ZH: Accuracy = 50.9
  • GAIA: Accuracy = 68.9

Pareto analysis of Depth vs. Verifiability places RhinoInsight at the frontier relative to all tested agent systems.

5. Component Ablations and Complementarity

Ablation studies validate the necessity and synergy of the control mechanisms:

Configuration DeepConsult Win (%) Avg Score GAIA Accuracy
No VCM/EAM 0–18 ≈3.65
VCM only ≈30 ≈5.3
EAM only ≈31 ≈5.45
VCM + EAM 68.5 6.82 68.9

This demonstrates that both Verifiable Checklist (VCM) and Evidence Audit Module (EAM) are mutually reinforcing.

6. Limitations and Prospective Developments

Several open areas are identified:

  • No adaptive policy (learnable or heuristic) exists for trading off enforcement strength of checklist versus audit intensity. This suggests a plausible direction for future policy learning work.
  • The evidence audit currently supports only text-based sources; integrating multimodal (image, table, code) evidence and rigorous provenance tracking remains a future research path.
  • Human-in-the-loop options for the critic are available but not tightly blended into an expert/LLM hybrid; future work may focus on mixed review strategies, especially for high-stakes domains.

In summary, RhinoInsight demonstrates that explicit, structured control of both model behavior (via the Verifiable Checklist) and information context (via the Evidence Audit) prevents error accumulation and context drift more efficiently than capacity scaling, achieving state-of-the-art deep research performance across multiple evaluation axes (Lei et al., 24 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to RhinoInsight.