Overview of the TREC 2020 deep learning track (2102.07662v1)

Published 15 Feb 2021 in cs.IR, cs.AI, cs.CL, and cs.LG

Abstract: This is the second year of the TREC Deep Learning Track, with the goal of studying ad hoc ranking in the large training data regime. We again have a document retrieval task and a passage retrieval task, each with hundreds of thousands of human-labeled training queries. We evaluate using single-shot TREC-style evaluation, to give us a picture of which ranking methods work best when large data is available, with much more comprehensive relevance labeling on the small number of test queries. This year we have further evidence that rankers with BERT-style pretraining outperform other rankers in the large data regime.

Authors (4)

Nick Craswell (51 papers)
Bhaskar Mitra (78 papers)
Emine Yilmaz (66 papers)
Daniel Campos (62 papers)

Citations (332)

View on Semantic Scholar

Summary

The paper highlights the robust evaluation framework that uses large labeled datasets to assess ad hoc ranking methods.
It demonstrates the superior performance of pre-trained neural models, like BERT, compared to traditional IR approaches.
The study emphasizes the benefits of incorporating ORCAS click data to enhance training efficiency and retrieval metrics.

An Analysis of the TREC 2020 Deep Learning Track Outcomes

The 2020 Text REtrieval Conference (TREC) Deep Learning Track provides a structured evaluation framework to assess ad hoc ranking methods within the context of large training datasets. This event, now in its second iteration, continued to focus on two primary evaluation tasks: document retrieval and passage retrieval, leveraging extensive human-labeled queries to explore the effectiveness of various ranking methodologies. This paper presents an insightful examination of the methodologies applied, findings uncovered, and extensive metric analysis generated from the track.

Key Aspects of TREC 2020 Deep Learning Track

The evaluation consisted of two major tasks—document retrieval and passage retrieval—with rigorous methodologies underpinning each. The evaluation process was unique in its use of comprehensive relevance labeling and a blind submission process intended to alleviate biases associated with overfitting. Augmented with the ORCAS click dataset, the attempt was to provide a multi-faceted perspective on retrieval performance, thereby delivering reusable test collections that serve broader research needs.

Document and Passage Retrieval Task Performance

A significant insight from the track was the superior performance of runs employing pre-trained neural LLMs, such as BERT, over traditional Information Retrieval (IR) methods across both retrieval tasks. The strong numerical data indicated measurable improvements in performance when employing neural approaches, especially within the passage retrieval context, where vocabulary mismatches are addressed more effectively through deep learning methodologies.

When considering end-to-end retrieval against reranking methodologies, the data suggested that while end-to-end retrieval approaches have the potential to recall more diverse and potentially relevant results, this did not translate to pronounced advantages in overall performance evaluated by metrics like NDCG@10.

Utilization of ORCAS Data

The integration of the ORCAS click dataset had marked implications for training efficiency and performance enhancements. Although possessing the ORCAS data was not essential to achieve peak results, several runs demonstrated improved retrieval metrics with its utilization, highlighting the benefit of larger, realistic data sets aligned closely with users' behavior.

Comparative Analysis Between NIST and MS MARCO Labels

A comparative analysis between the NIST-labeled evaluations (comprehensive labels) and MS MARCO labels (sparse labels) showed a respectable alignment in judgment, notably within passage retrieval tasks. However, document retrieval exhibited a reduced correlation, likely influenced by convergence patterns where methods are more tailored to single-relevance point training data from MS MARCO.

Implications and Future Prospects

The outcomes of the TREC 2020 Deep Learning Track underscore the pivotal role of neural models in advancing retrieval tasks, with implications extending to fields requiring efficient data-driven inference from large datasets. The deployment of deep learning in end-to-end systems remains an exciting domain for future work, as it may facilitate substantial improvements through more integrated multi-stage retrieval architectures.

Further, refining evaluation paradigms will be vital to ensure fair and balanced progress in information retrieval, addressing any anomalies such as those observed between different label sets and the influence of ORCAS data. Overall, these results are poised to inform the next wave of advancements in retrieval systems, prompting new research questions and technological innovations aligned with the needs of the evolving information retrieval landscape.

PDF Markdown

Related Papers

Passage Re-ranking with BERT (2019)
Overview of the TREC 2019 deep learning track (2020)
Multi-Stage Document Ranking with BERT (2019)
Understanding the Behaviors of BERT in Ranking (2019)
Learning Passage Impacts for Inverted Indexes (2021)