Overview of the TREC 2019 deep learning track (2003.07820v2)

Published 17 Mar 2020 in cs.IR, cs.CL, and cs.LG

Abstract: The Deep Learning Track is a new track for TREC 2019, with the goal of studying ad hoc ranking in a large data regime. It is the first track with large human-labeled training sets, introducing two sets corresponding to two tasks, each with rigorous TREC-style blind evaluation and reusable test sets. The document retrieval task has a corpus of 3.2 million documents with 367 thousand training queries, for which we generate a reusable test set of 43 queries. The passage retrieval task has a corpus of 8.8 million passages with 503 thousand training queries, for which we generate a reusable test set of 43 queries. This year 15 groups submitted a total of 75 runs, using various combinations of deep learning, transfer learning and traditional IR ranking methods. Deep learning runs significantly outperformed traditional IR runs. Possible explanations for this result are that we introduced large training data and we included deep models trained on such data in our judging pools, whereas some past studies did not have such training data or pooling.

Authors (5)

Nick Craswell (51 papers)
Bhaskar Mitra (78 papers)
Emine Yilmaz (66 papers)
Daniel Campos (62 papers)
Ellen M. Voorhees (5 papers)

Citations (395)

View on Semantic Scholar

Summary

An Analytical Overview of the TREC 2019 Deep Learning Track

The paper "Overview of the TREC 2019 Deep Learning Track" provides an exhaustive account of the newly established Deep Learning Track at TREC 2019. This track was introduced to scrutinize ad hoc ranking within large data sets, featuring both a document retrieval and a passage retrieval task. It is significant for giving heightened attention to the deployment of deep learning (DL) methods compared to traditional information retrieval (IR) techniques within such large-scale environments. Crucially, the research presented here underscores the efficacy of DL models when significant human-labeled data for training is accessible.

Key Contributions

Introduction of Large Human-Labeled Datasets: For the first time, TREC introduced an extensive set of human-labeled data purview for ad hoc retrieval. The document retrieval task was based on a corpus of over 3.2 million documents and in excess of 367,000 training queries, while the passage retrieval corpus comprised approximately 8.8 million passages with over 503,000 queries.
Comparison of Retrieval Models: It was observed that DL models, particularly those leveraging pre-trained neural LLMs like BERT, marked significant performance improvements over traditional IR methods. The paper argues that this performance leap might be attributed to the availability of large training datasets and DL models' inclusion in the evaluation pools.
Reranking vs. Full Retrieval Evaluation: The track delineated two paradigms: a reranking model where output from a fixed phase one process is reranked, and a full retrieval model where the entire document indexing and retrieval pipeline is implemented by the participating teams. This setup tested the limits of DL in both scenarios, noting that reranking approaches performed comparably to full retrieval.
Test Collection Robustness: Validation of the reliability of new test collections was a focal point, with methodologies involving dynamic relevance feedback such as HiCAL being used to build a test collection that's robust and reusable while offering comprehensive judgment scales.

Numerical Results

Deep learning runs notably outperformed traditional IR runs, as demonstrated by metrics such as NDCG@10. For instance, the highest NDCG@10 for document retrieval by a DL run utilizing BERT was recorded at approximately 0.726 versus a top traditional method achieving around 0.548. Equally, in passage retrieval, DL models continued to excel, with NDCG@10 metrics reaching as high as 0.764.

Implications for AI Research

This research has broad implications for the future of AI and retrieval technologies. The pivotal role of pretrained models in obtaining high ranking performance highlights the potential future trajectory for IR systems. Moreover, this work points towards a trend where DL models might overtake traditional models, once robust large-scale datasets become the norm across tasks.

Additionally, by showcasing the effectiveness of re-ranking strategies, the paper paves the way for more efficient retrieval systems, which can balance computational resource usage with retrieval effectiveness. Furthermore, the TREC approach of evaluating systems with a robust blind test is critical for providing realistic benchmarks beyond synthetic or proprietary datasets.

Future Directions

The possible future development of AI, as suggested by this track, should include continued evaluation of various model architectures and the exploration of the data volume-performance boundary. Expanding on the methodologies for creating reliable and reusable test collections, while integrating novel DL applications in IR, remains an exciting field for exploration. It will be important in future iterations to see further diversification of non-neural methods to robustly evaluate where DL truly stands relative to traditional methods.

In conclusion, the TREC 2019 Deep Learning Track harnesses the power of DL in the context of large data regimes and opens numerous avenues for subsequent research and cross-comparison of IR strategies. This establishes a stepping stone for further innovations in both the practical deployment of retrieval systems and theoretical advancement in the AI domain.

PDF Markdown

Related Papers

Find Related Papers