Learning to Match Using Local and Distributed Representations of Text for Web Search (1610.08136v1)

Published 26 Oct 2016 in cs.IR

Abstract: Models such as latent semantic analysis and those based on neural embeddings learn distributed representations of text, and match the query against the document in the latent semantic space. In traditional information retrieval models, on the other hand, terms have discrete or local representations, and the relevance of a document is determined by the exact matches of query terms in the body text. We hypothesize that matching with distributed representations complements matching with traditional local representations, and that a combination of the two is favorable. We propose a novel document ranking model composed of two separate deep neural networks, one that matches the query and the document using a local representation, and another that matches the query and the document using learned distributed representations. The two networks are jointly trained as part of a single neural network. We show that this combination or `duet' performs significantly better than either neural network individually on a Web page ranking task, and also significantly outperforms traditional baselines and other recently proposed models based on neural networks.

PDF Abstract

Overview of the 'Duet' Document Ranking Model

The paper presents a novel document ranking model, termed the "duet" architecture. This model integrates two distinct deep neural networks (DNNs) that harness both local and distributed text representations to enhance retrieval accuracy in web search tasks. The central hypothesis is that the combination of these representations complements each other, offering a robust mechanism for improving document retrieval performance over individual models.

Model Architecture

The duet architecture consists of:

Local Model: This segment operates on exact term matches reminiscent of traditional IR models like BM25 and QL. It leverages an interaction matrix to capture exact term occurrences, preserving positional information crucial for recognizing key term proximity.
Distributed Model: This segment uses neural embeddings to capture semantic nuances by projecting queries and documents into a latent space. By employing character $n$ -grams, the distributed model excels in addressing vocabulary mismatches—detecting synonyms and related terms beyond exact matches.

These networks are jointly optimized within a unified framework, allowing them to learn complementary aspects of relevance. The duet architecture aims to balance fine-grained term-specific signals with broader semantic relationships.

Empirical Evaluation

The paper reports substantial improvements in document ranking tasks when using the duet model. Key findings include:

The duet model significantly outperformed both the local and distributed models individually across various testing conditions.
It demonstrated considerable improvement over traditional baselines (e.g., BM25, LSA) and contemporary neural models (e.g., DSSM, CDSSM, DRMM).

The performance gain was particularly notable with more frequent queries, where semantic understanding contributes significantly. Furthermore, the analysis revealed that training with human-judged negative examples is more effective than random sampling, which is a crucial consideration for data preparation in IR tasks.

Implications and Future Directions

The proposed architecture represents an advancement in combining exact and inexact matching for document retrieval. The results underscore the importance of joint learning to leverage both matching types effectively. The discussion opens avenues for:

Further exploration of even larger datasets for training deep models, as initial findings suggest more data could boost performance.
Investigating more efficient runtime strategies to facilitate scalable deployment in production search engines, ensuring computational feasibility remains a focus.
Deepening the exploration of how such models handle tail queries, given that local representations might underperform with very rare terms.

Conclusion

The duet model is a compelling approach that integrates both precise and abstract representations, establishing a hierarchy of retrieval strategies that adapt dynamically to query characteristics. By outperforming established methods, this model marks a step forward in the AI-driven enhancement of information retrieval systems, with potential for ongoing improvements as computational resources and datasets expand.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Bhaskar Mitra (78 papers)
Fernando Diaz (52 papers)
Nick Craswell (51 papers)

Citations (469)

View on Semantic Scholar