LLM-based Search Engines

Updated 12 December 2025

LLM-based search engines are systems that combine traditional retrieval, ranking, and dialogue management with large language models to produce generative, conversational responses.
Their architecture, exemplified by retrieval-augmented generation, streamlines information retrieval by reducing search time by about 35% and increasing query diversity.
They enhance user experience with interactive dialogue and inline citations, though they may reduce manual sense-making essential for deep learning.

An LLM-based Search Engine (LLM-SE) integrates classical information retrieval (IR) infrastructure (indexing, term-based retrieval, ranking) with a LLM to produce generative, context-aware responses and conversational search experiences. Unlike traditional search engines, which return ranked document lists, LLM-SEs can synthesize information, interpret multi-turn user intents, and engage in interactive dialogue. This paradigm shift is reshaping information-seeking, learning, and the architecture of search systems (Guan et al., 29 Nov 2025).

1. Definitions, Taxonomy, and Theoretical Foundations

An LLM-SE is a retrieval system that tightly couples core IR modules—retrieval, re-ranking, dialogue management—with a LLM for generative and conversational output. The architectural taxonomy is:

Retrieval-Only System: Pure IR, e.g., BM25, learning-to-rank.
Retrieval-Augmented Generation (RAG): Retrieve top-k documents, then prompt the LLM with both query and retrieved content.
End-to-End Generative SE: LLM generates responses from its parametric knowledge without explicit retrieval.
Hybrid Conversational SE: Blends RAG, retrieval-only, and dialogue components to handle clarifications, multi-turn context, and inline citation (Guan et al., 29 Nov 2025).

Formally, in RAG systems, answer generation is grounded on query and retrieved docs:

$g = G(q, C, D)$

where $q$ is the user query, $C$ the conversation context, and $D$ the retrieved document set.

Ranking and retrieval can be hybridized:

$S(q,d) = \alpha\, R(q, d) + \beta\, G_\text{meas}(q, d)$

where $R(q,d)$ is retriever score (e.g., BM25), $G_\text{meas}(q,d)$ measures generative relevance, $\alpha + \beta=1$ .

2. System Architectures and Key Components

Canonical LLM-SE Architecture

flowchart LR
    Q[User Query] --> R[Retrieval Module]
    R --> C[Context Construction]
    C --> LLM[LLM Module]
    LLM --> UI[Conversational UI + References]

Components:

Retrieval Module (R): Returns candidate documents for a query.
LLM Module: Consumes query, context, and retrieved docs to generate answers.
Dialogue Manager: Maintains state; manages clarifications and context.
User Interface: Presents both AI-generated summaries and referenced document snippets.

RAG systems operationalize a generation/retrieval fusion:

$\text{Score}(q, d) = \lambda \cdot \text{BM25}(q, d) + (1 - \lambda) \cdot \text{Sim}_{\text{LLM}}(q, d)$

3. Search and Learning Behavior: Experimental Insights

A within-subject paper compared Bing Copilot (LLM-SE) with traditional Bing search on research/learning tasks (Guan et al., 29 Nov 2025):

Search Efficiency: Copilot reduced total search time by ~35% (12 min vs. 18 min), with longer but fewer queries ( $L(q)_{Copilot}=8.5$ , $L(q)_{Bing}=3.2$ tokens), average docs viewed per task dropped from 6.8 to 4.2.
Exploration and Formulation: Copilot users exhibited 40% higher query-term entropy (more diverse prompts) and more focused sub-queries via iterative refinements of AI output.
Information Collection: Copilot users highlighted and filtered AI summaries; Bing users engaged in deeper manual grouping (≈2× more copy/paste events).
Affective Outcomes: Emotional satisfaction scores were higher with the LLM-SE (median 8 vs. 6).
Learning: Perceived learning scores improved with Copilot (median 8 vs. 6); however, deeper learning (as per constructionist theory) still required manual synthesis activities, more common in traditional search (Guan et al., 29 Nov 2025).

4. Implications for User Experience and Cognitive Process

LLM-SEs alter the cognitive and operational dynamics of search:

Reduced Cognitive Load: Generative suggestions streamline exploration and query formulation for novices, supporting higher-level or multi-step reasoning.
Trade-off: Absent active sense-making (as in traditional search), users may bypass deep information "collection" and synthesis steps, risking weaker memory consolidation and learning retention.
User Strategy Diversification: LLM-SEs facilitate conversational, natural language queries, but users may struggle to translate non-standard needs (e.g., precise geolocation) into effective prompts (Wazzan et al., 18 Jan 2024). Collection and adaptation strategies vary, with LLM-SE users often rephrasing rather than extending queries.

5. Design Recommendations and Best Practices

Empirical findings motivate several recommendations for effective LLM-SE deployment (Guan et al., 29 Nov 2025):

Hybrid Frameworks: Expose both AI summaries and the underlying top-k documents in the UI; allow users to adjust $\alpha/\beta$ in the combined relevance function $S(q,d)$ .
Progressive Disclosure & Citation: Start with high-level synthesized responses, enable drill-down to sources and snippets, ensure inline citations to support trust.
Prompt Engineering/Scaffolding: Provide template or system-suggested prompt expansions; encourage user annotation or critique to foster chain-of-thought reasoning.
Learning & Sense-making Tools: Integrate note-taking, mind mapping, and explicit “construct modes” for manual grouping and reflection on generated content.

6. Evaluation Metrics and Comparisons

Behavioral Metrics:

Query length, reformulation rate, browsing depth, document-view count, time per page.
Comprehension score $C = \text{Correct answers} / \text{Total questions}$ .
Critical Thinking rubric $CT = \sum_{i=1}^m w_i e_i$ , $e_i \in \{0,1,2\}$ .

Subjective Metrics:

11-point Likert satisfaction, perceived learning, interest.

Statistical Analysis:

Mann-Whitney U tests for between-condition effects ( $\alpha=0.05$ ).
Cohen’s d, effect sizes (Guan et al., 29 Nov 2025).

Findings:

Metric	Bing (Traditional)	Copilot (LLM-SE)	Significant?
Query length $L(q)$	3.2 tokens	8.5 tokens	$p<0.05$
Docs viewed per task	6.8	4.2	$p=0.03$
Satisfaction (Median)	6	8	$p=0.01$
Perceived Learning (Median)	6	8	$p=0.02$
Total search time (Mean)	18 min	12 min	$U=4$ , $p=0.02$

7. Future Directions and Open Challenges

Hybrid Transparency: Blend retrieval transparency (doc lists, citations) with generative fluency.
Active Sense-making: Incorporate scaffolds and interfaces that encourage knowledge construction, not just passive absorption.
Adaptive Controls: Enable user control over the diversity and specificity of LLM-SE outputs through interface tunables.
Iterative Evaluation: Deploy robust, multi-modal evaluation frameworks to link behavioral metrics (e.g., query formulation/diversity) with downstream cognitive/learning outcomes.

8. Conclusion

LLM-based Search Engines represent a pivotal evolution in information retrieval, efficiently combining traditional retrieval pipelines with conversational, generative outputs. While greatly improving initial exploration and perceived learning efficiency, they may reduce manual sense-making activities crucial for deep learning. The optimal design of LLM-SEs will blend generative strength with transparency, modularity, and user-driven scaffolding to support both rapid knowledge acquisition and durable understanding (Guan et al., 29 Nov 2025).