RhinoInsight: LLM Deep Research Engine
- RhinoInsight is a framework powered by large language models that integrates Verifiable Checklist and Evidence Audit modules to ensure robust, traceable research outputs.
- It employs a five-component ReAct-style loop to coordinate multi-step reasoning, tool-mediated search, and structured report generation without modifying model parameters.
- Empirical results demonstrate its state-of-the-art performance on deep research and search benchmarks, outperforming existing agents in diverse domains and languages.
RhinoInsight is a large-language-model–driven framework for deep research tasks requiring multi-step reasoning, tool-mediated search, and long-form report generation. It primarily advances two orthogonal control mechanisms—Verifiable Checklist and Evidence Audit—to mitigate error propagation and context degradation, ensuring robustness and traceability in the outputs of generative agents. RhinoInsight operates atop a five-component ReAct-style loop, controlling both the model’s plan execution and the structuring of external evidence, without necessitating parameter updates or architectural modification. Experiments show that RhinoInsight sets a new state-of-the-art on benchmarks for deep reasoning and search, with competitive performance on diverse domains and languages (Lei et al., 24 Nov 2025).
1. Formal Structure of the Verifiable Checklist Module
At initialization (), RhinoInsight receives a user query and establishes an editable outline as well as a checklist of natural-language acceptance criteria. The checklist constrains planning, prevents non-executable goals, and enables traceable verification at every downstream step.
Checklist construction proceeds as follows:
- Checklist Generation: ; where each is an acceptance test (e.g., "Define the scope of X").
- Intent Refinement: Ambiguities in invoke , with being the empty state.
- Critic Update: A critic (human or LLM) refines : , possibly splitting, merging, or clarifying items and enforcing logical exclusivity/dependencies.
- Hierarchical Outline Construction: The final checklist is organized into a tree-structured outline : each outline node is anchored to a specific checklist item, supporting multiple levels of subgoals.
The process is formally captured as: This structure prevents unconstrained planning and enforces that every unit of execution is linked to a verifiable acceptance test.
2. Algorithmic Description of the Evidence Audit Module
The Evidence Audit module converts unstructured tool outputs into a structured, lean, and verifiable memory , ensuring report claims are empirically anchored and hallucinations are minimized. Its pipeline consists of two stages.
Stage 1: Search → Memory → Outline Updates
- For :
- Generate search tasks from the outline .
- Perform web searches: .
Normalize, structure, and summarize into :
where
- : URL normalization/boilerplate removal,
- : grouping by source, timestamp, confidence, and outline node alignment,
- : summarization and persistence into .
- Outline is refined based on supporting evidence, guaranteeing nodes , at least supporting cluster.
Stage 2: Drafting, Evidence Extraction, Critiquing, and Report Construction
- Upon stopping ( or ):
- Draft node-aligned content .
- Extract citations relating outline labels to evidence .
- For each node :
- Retrieve candidate evidence: .
- Rank and prune with critic: .
Compose final report :
Scoring function for evidence ranking:
The most relevant, high-quality, consistent, and recent evidence is retained, with remaining items systematically pruned.
Pseudocode Summary:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
\textbf{Initialize:}\quad E\leftarrow\emptyset,\;O\leftarrow O_{1}
\FOR t=1\textbf{ to }T \DO
Z_t \leftarrow \mathrm{Plan}(O)
R_t \leftarrow \mathrm{Search}(Z_t)
E \leftarrow E\;\cup\;\mathcal{P}(\mathcal{S}(\mathcal{N}(R_t)))
O \leftarrow \mathrm{RefineOutline}(O,\,E)
\text{if }\mathrm{stop}(O,E)=1\textbf{ then break}
\ENDFOR
\textbf{Draft:}\quad (\mathcal{T},\mathcal{V},\mathcal{C})\leftarrow\mathrm{Draft}(O,E)
\textbf{ExtractCritic:}\quad R\leftarrow\emptyset
\FOR each node\;n\in O\;\DO
C_n\leftarrow\mathrm{Retrieve}(E,n)
c^\star_n\leftarrow\mathrm{RankCritic}(C_n)
R\leftarrow R\cup\{c^\star_n\}
\ENDFOR
\textbf{Write:}\quad R\leftarrow\mathrm{Write}(\mathcal{T},\mathcal{V},R,\mathcal{C}) |
3. Empirical Guarantees and Theoretical Considerations
RhinoInsight does not provide formal convergence proofs or worst-case complexity bounds for either the Verifiable Checklist or Evidence Audit modules. Its correctness and effectiveness are empirically validated through robustness and verifiability metrics, as opposed to theoretical theorems. This suggests its primary guarantees are statistical and based on observed improvements over existing baselines rather than hard deterministic constraints (Lei et al., 24 Nov 2025).
4. Benchmarking and Quantitative Performance
RhinoInsight has been evaluated on high-complexity deep research and search benchmarks:
- Deep Research (T+V+C): DeepConsult (business/consulting objectives), DeepResearch Bench (RACE, 100 tasks across 22 domains).
- Deep Search (T): HLE (Humanity’s Last Exam — text-only), BrowseComp-ZH (Chinese multi-hop browsing), GAIA (multi-step reasoning).
- Model Backbone: Gemini-2.5-Pro; baselines include competitive LLM+search tools and proprietary deep research agents.
Results for DeepConsult and RACE demonstrate clear superiority, as shown below:
| Benchmark | Win (%) | Tie (%) | Lose (%) | Avg Score | Top Metric |
|---|---|---|---|---|---|
| DeepConsult | 68.51 | 11.02 | 20.47 | 6.82 | All |
| RACE: Overall | 50.92 | Top | |||
| RACE: Compreh. | 50.51 | Top | |||
| RACE: Insight | 51.45 | Top | |||
| RACE: Inst-Foll | 51.72 | Top | |||
| RACE: Readab. | 50.00 | Tied |
On deep search tasks:
- HLE: Accuracy = 27.1
- BrowseComp-ZH: Accuracy = 50.9
- GAIA: Accuracy = 68.9
Pareto analysis of Depth vs. Verifiability places RhinoInsight at the frontier relative to all tested agent systems.
5. Component Ablations and Complementarity
Ablation studies validate the necessity and synergy of the control mechanisms:
| Configuration | DeepConsult Win (%) | Avg Score | GAIA Accuracy |
|---|---|---|---|
| No VCM/EAM | 0–18 | ≈3.65 | |
| VCM only | ≈30 | ≈5.3 | |
| EAM only | ≈31 | ≈5.45 | |
| VCM + EAM | 68.5 | 6.82 | 68.9 |
This demonstrates that both Verifiable Checklist (VCM) and Evidence Audit Module (EAM) are mutually reinforcing.
6. Limitations and Prospective Developments
Several open areas are identified:
- No adaptive policy (learnable or heuristic) exists for trading off enforcement strength of checklist versus audit intensity. This suggests a plausible direction for future policy learning work.
- The evidence audit currently supports only text-based sources; integrating multimodal (image, table, code) evidence and rigorous provenance tracking remains a future research path.
- Human-in-the-loop options for the critic are available but not tightly blended into an expert/LLM hybrid; future work may focus on mixed review strategies, especially for high-stakes domains.
In summary, RhinoInsight demonstrates that explicit, structured control of both model behavior (via the Verifiable Checklist) and information context (via the Evidence Audit) prevents error accumulation and context drift more efficiently than capacity scaling, achieving state-of-the-art deep research performance across multiple evaluation axes (Lei et al., 24 Nov 2025).