T2Ranking: A large-scale Chinese Benchmark for Passage Ranking (2304.03679v1)

Published 7 Apr 2023 in cs.IR

Abstract: Passage ranking involves two stages: passage retrieval and passage re-ranking, which are important and challenging topics for both academics and industries in the area of Information Retrieval (IR). However, the commonly-used datasets for passage ranking usually focus on the English language. For non-English scenarios, such as Chinese, the existing datasets are limited in terms of data scale, fine-grained relevance annotation and false negative issues. To address this problem, we introduce T2Ranking, a large-scale Chinese benchmark for passage ranking. T2Ranking comprises more than 300K queries and over 2M unique passages from real-world search engines. Expert annotators are recruited to provide 4-level graded relevance scores (fine-grained) for query-passage pairs instead of binary relevance judgments (coarse-grained). To ease the false negative issues, more passages with higher diversities are considered when performing relevance annotations, especially in the test set, to ensure a more accurate evaluation. Apart from the textual query and passage data, other auxiliary resources are also provided, such as query types and XML files of documents which passages are generated from, to facilitate further studies. To evaluate the dataset, commonly used ranking models are implemented and tested on T2Ranking as baselines. The experimental results show that T2Ranking is challenging and there is still scope for improvement. The full data and all codes are available at https://github.com/THUIR/T2Ranking/

Authors (11)

Xiaohui Xie (84 papers)
Qian Dong (25 papers)
Bingning Wang (29 papers)
Feiyang Lv (2 papers)
Ting Yao (127 papers)
Weinan Gan (5 papers)
Zhijing Wu (21 papers)
Xiangsheng Li (7 papers)
Haitao Li (65 papers)
Yiqun Liu (131 papers)
Jin Ma (64 papers)

Citations (38)

View on Semantic Scholar

Summary

An Overview of $\rm T^2Ranking$ : A Large-Scale Chinese Benchmark for Passage Ranking

The field of Information Retrieval (IR) continually seeks advancements that improve passage ranking—a task critical to enhancing user satisfaction across various applications such as question answering and reading comprehension. While substantial progress has been documented in English-centric research landscapes, analogous advancements for Chinese textual data remain comparatively constrained due to limited large-scale, finely annotated benchmarks. The paper " $\rm T^2Ranking$ : A large-scale Chinese Benchmark for Passage Ranking" addresses this discrepancy by introducing a comprehensive dataset tailored to Chinese passage ranking tasks.

Dataset Construction and Features

$\rm T^2Ranking$ emerges with a sophisticated architecture, showcasing over 300,000 queries and in excess of 2 million unique passages. A pivotal feature underpinning the construction of this benchmark is the rigorous, fine-grained annotation employed, covering 4 levels of relevance, which surpasses the coarse, binary annotation paradigms observed in preceding datasets. This nuanced annotation enables a more intricate evaluation of passage ranking models.

The dataset derives its queries from real-world Sogou search logs, emphasizing query relevance through stringent preprocessing and normalization techniques. The passages comprise content scrapped from multiple search engines, ensuring coverage breadth and diversity. To address issues tied to latent false negatives within existing datasets, $\rm T^2Ranking$ executes a comprehensive annotation across all query-passage pairs in its test set. This strategy not only ensures the precision of evaluations but aligns with realistic search application dynamics.

$\rm T^2Ranking$ further leverages advanced methodologies such as model-based passage segmentation and clustering-based de-duplication. The former aims to maintain semantic integrity within each passage, while the latter mitigates redundancy, enhancing annotation efficiency.

Methodological Innovations

The utilization of a segmentation model trained on Wikipedia and other well-written Chinese web articles exemplifies the meticulous attention to semantic detail that $\rm T^2Ranking$ embodies. Furthermore, this benchmark employs active learning strategies to emphasize the most informative samples, augmenting the dual-phase (retrieval and re-ranking) passage ranking paradigm.

Experimental Validation and Results

The dataset's authors conducted extensive baseline experiments, employing both sparse (e.g., BM25) and dense retrieval models (e.g., Dual-Encoder with BM25 Neg sampling strategy), as well as re-ranking via cross-encoders. Results indicated robust performances with room for improvement, particularly when employing dense retrieval models that demonstrate superior results compared to traditional sparse methods. However, the fine-grained relevance data also posits challenges, as observed through lower recall metrics relative to English datasets, suggesting ample opportunities for model refinement and innovation.

Theoretical and Practical Implications

The introduction of $\rm T^2Ranking$ holds substantial implications for both theoretical and practical advancements in the domain of IR. The meticulous annotation scheme encourages the development of models with heightened sensitivity to nuanced contextual signals, potentially steering future research towards more sophisticated semantic understanding algorithms.

Practically, $\rm T^2Ranking$ stands to significantly elevate the efficiency and efficacy of search engines and other retrieval-based services in Chinese linguistic domains. By laying a foundation for improved passage ranking model training, the benchmark could lead to more nuanced and contextually aware tools—benefits that are particularly pivotal in the expanding landscape of AI applications within non-English languages.

In conclusion, $\rm T^2Ranking$ represents a step forward in diversifying and enriching the suite of resources available to the IR community, laying the groundwork for future explorations that could bridge the performance gap between English and Chinese passage ranking tasks. As researchers further explore and build upon such benchmarks, the results could inform a broad array of machine learning applications, enhancing IR capability across various global contexts.

PDF Markdown

Related Papers

GitHub

GitHub - THUIR/T2Ranking: T2Ranking: A large-scale Chinese benchmark for passage ranking. (139 stars)