Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 56 tok/s
Gemini 2.5 Pro 39 tok/s Pro
GPT-5 Medium 15 tok/s Pro
GPT-5 High 16 tok/s Pro
GPT-4o 99 tok/s Pro
Kimi K2 155 tok/s Pro
GPT OSS 120B 476 tok/s Pro
Claude Sonnet 4 38 tok/s Pro
2000 character limit reached

A Survey of Model Architectures in Information Retrieval (2502.14822v1)

Published 20 Feb 2025 in cs.IR

Abstract: This survey examines the evolution of model architectures in information retrieval (IR), focusing on two key aspects: backbone models for feature extraction and end-to-end system architectures for relevance estimation. The review intentionally separates architectural considerations from training methodologies to provide a focused analysis of structural innovations in IR systems.We trace the development from traditional term-based methods to modern neural approaches, particularly highlighting the impact of transformer-based models and subsequent LLMs. We conclude by discussing emerging challenges and future directions, including architectural optimizations for performance and scalability, handling of multimodal, multilingual data, and adaptation to novel application domains beyond traditional search paradigms.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper surveys information retrieval model architectures, tracing their evolution from traditional methods to modern neural approaches, focusing on architectural design rather than training methodologies.
  • It details various model types including Boolean, vector space, probabilistic, language models, LTR, neural ranking models (DSSM, DRMM, MatchPyramid), and transformer-based models like BERT, ColBERT, and applications of LLMs.
  • The survey identifies future directions and challenges in IR model architectures, including performance optimization, scalability, handling multimodal and multilingual data, and adaptation for autonomous search agents.

This paper presents a survey of model architectures in Information Retrieval (IR), emphasizing backbone models for feature extraction and end-to-end system architectures for relevance estimation. The survey focuses on architectural considerations, intentionally separating them from training methodologies.

The paper traces the evolution of IR systems from traditional term-based methods, such as Boolean and vector space models, to neural approaches, focusing on transformer-based models and LLMs. It concludes with a discussion of challenges and future directions, including architectural optimizations for performance and scalability, handling of multimodal, multilingual data, and adaptation to new application domains.

The paper begins by defining the ad hoc retrieval task, where given a query Q\mathcal{Q}, the objective is to find a ranked list of kk documents, denoted as {D1,D2,,Dk}\lbrace \mathcal{D}_1, \mathcal{D}_2, \ldots, \mathcal{D}_k \rbrace, that exhibit the highest relevance to Q\mathcal{Q}. Performance is measured using standard IR metrics like Mean Reciprocal Rank, Recall, and normalized Discounted Cumulative Gain (nDCG).

Traditional IR models are examined, including:

  • Boolean Model: Documents D\mathcal{D} are represented as a set of terms {t1,t2,,tn}\{t_1, t_2, \dots, t_n\}, and relevance is determined by a logical implication DQ\mathcal{D} \rightarrow \mathcal{Q}.
  • Vector Space Model: Queries and documents are represented as vectors, e.g., Q=<q1,q2,,qn>\mathcal{Q}=<q_1, q_2, \dots, q_n> and D=<d1,d2,,dn>\mathcal{D} = <d_1, d_2, \dots, d_n>. The relevance score is estimated by a similarity function between the query Q\mathcal{Q} and the document D\mathcal{D}.
  • Probabilistic Model: The relevance score depends on a set of events {xi}1n\{x_i\}_1^{n} representing the occurrence of term tit_i in the document. The simplest model is the binary independence retrieval model, where the relevance score is: Score(Q,D)(xi=1)Dlogri(TniR+ri)(Rri)(niri)\text{Score}(\mathcal{Q},\mathcal{D}) \propto \sum_{(x_i=1)\in\mathcal{D}} \log \frac{r_i(T-n_i-R+r_i)}{(R-r_i)(n_i-r_i)}, where:

TT = total number of sampled judged documents RR = number of relevant samples

nin_i = number of samples containing tit_i

rir_i = number of relevant samples containing tit_i

  • Statistical LLM: The relevance score is estimated via P(DQ)\mathcal{P}(\mathcal{D}|\mathcal{Q}), derived as directly proportional to P(QD)P(D)\mathcal{P}(\mathcal{Q}|\mathcal{D})\mathcal{P}(\mathcal{D}) based on Bayes Rule. The main focus is on modeling P(QD)\mathcal{P}(\mathcal{Q}|\mathcal{D}) as a ranking function by treating the query as a set of independent terms as Q={ti}i=1n\mathcal{Q}=\{t_i\}_{i=1}^n, thus $\mathcal{P}(\mathcal{Q}|\mathcal{D})=\prod_{t_i \in \mathcal{Q}\mathcal{P}(t_i|\mathcal{D}).$ The probability P(tiD)\mathcal{P}(t_i|\mathcal{D}) is determined using a statistical LLM θD\theta_{D} that represents the document, then the relevance is estimated by log-likelihood as

$\text{Score}(\mathcal{Q},\mathcal{D}) = \log\mathcal{P}(\mathcal{Q}|\theta_{D}) = \sum_{t_i \in \mathcal{Q}\log\mathcal{P}(t_i|\theta_{D}),$

Learning-to-Rank (LTR) models utilize supervised ML on numerical features. For each (Qi,Di)(\mathcal{Q}_i, \mathcal{D}_i) pair, a kk-dimensional feature vector xiRk\mathbf{x}_i \in \mathbb{R}^{k} and a relevance label yi\mathbf{y}_i is provided to the ranking model ff. The ranking is trained to minimize the empirical loss on labeled training set Ψ\Psi: L=1/Ψ(xi,yi)Ψl(fθ(xi),yi)\mathcal{L} = 1 / |\Psi| \sum_{(\mathbf{x}_i, \mathbf{y}_i) \in \Psi} l(f_{\theta}(\mathbf{x}_i), \mathbf{y_i}). LTR models include ML-based models such as RankSVM and LambdaMART (based on Gradient Boosted Decision Trees (GBDT)), as well as neural LTR models like RankNet and LambdaRank.

Neural ranking models use deep neural networks to learn feature representations directly from raw text. Depending on how queries interact with documents, these models are divided into representation-based models and interaction-based models. Representation-based models, such as the Deep Structured Semantic Model (DSSM), independently encode queries and documents into a latent vector space. Interaction-based models process queries and documents jointly through neural networks. MatchPyramid employs CNNs over the interaction matrix between query and document terms, while the Deep Relevance Matching Model (DRMM) constructs matching histograms for each query term.

The paper discusses IR architectures based on pre-trained transformers, with a focus on BERT-type encoder models. BERT's success is attributed to multi-head attention and large-scale pre-training. The paper covers text reranking, learned dense retrieval, learned sparse retrieval (LSR), and multi-vector representations. For text reranking, models like monoBERT concatenate (Q,D)(\mathcal{Q}, \mathcal{D}) as input and output a relevance score. Learned dense retrieval uses bi-encoders to encode queries and documents separately, computing relevance with similarity functions. Learned sparse retrieval also uses a bi-encoder architecture to transform documents into a sparse vector for faster retrieval. Multi-vector representations, exemplified by ColBERT, represent each token in the query and document as a contextualized vector, enhancing interaction.

The paper also discusses the use of LLMs for IR tasks. LLMs have exhibited proficiency in language understanding and generation and can be used for feature extraction and relevance estimation. Adopting an LLM as the backbone for a bi-encoder retrieval model has improved performance compared to smaller models like BERT. LLMs can also be fine-tuned as cross-encoder rerankers or used as unsupervised rerankers through prompting techniques. Generative retrieval, which bypasses the indexing step by using autoregressive LLMs to directly generate document identifiers (DocIDs), is also discussed.

Finally, the survey identifies emerging directions and challenges in IR, including the need for better models for feature extraction, flexible relevance estimators, and addressing open questions related to the end "user" of retrieval and autonomous search agents. Key areas for model improvement include parallelizable and low precision training, inference optimization, data efficiency, multimodality and multilinguality, and transformer alternatives like linear RNNs and state space models.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-Up Questions

We haven't generated follow-up questions for this paper yet.