Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Data Augmentation for Sample Efficient and Robust Document Ranking (2311.15426v1)

Published 26 Nov 2023 in cs.IR

Abstract: Contextual ranking models have delivered impressive performance improvements over classical models in the document ranking task. However, these highly over-parameterized models tend to be data-hungry and require large amounts of data even for fine-tuning. In this paper, we propose data-augmentation methods for effective and robust ranking performance. One of the key benefits of using data augmentation is in achieving sample efficiency or learning effectively when we have only a small amount of training data. We propose supervised and unsupervised data augmentation schemes by creating training data using parts of the relevant documents in the query-document pairs. We then adapt a family of contrastive losses for the document ranking task that can exploit the augmented data to learn an effective ranking model. Our extensive experiments on subsets of the MS MARCO and TREC-DL test sets show that data augmentation, along with the ranking-adapted contrastive losses, results in performance improvements under most dataset sizes. Apart from sample efficiency, we conclusively show that data augmentation results in robust models when transferred to out-of-domain benchmarks. Our performance improvements in in-domain and more prominently in out-of-domain benchmarks show that augmentation regularizes the ranking model and improves its robustness and generalization capability.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (85)
  1. Shivani Agarwal and Michael Collins. 2010. Maximum margin ranking algorithms for information retrieval. In Advances in Information Retrieval: 32nd European Conference on IR Research, ECIR 2010, Milton Keynes, UK, March 28-31, 2010. Proceedings 32. Springer, 332–343.
  2. Supervised Contrastive Learning Approach for Contextual Ranking. In Proceedings of the 2022 ACM SIGIR International Conference on Theory of Information Retrieval. 61–71.
  3. FIRE 2019 AILA track: Artificial intelligence for legal assistance. In Proceedings of the 11th Annual Meeting of the Forum for Information Retrieval Evaluation. 4–6.
  4. Inpars: Unsupervised dataset generation for information retrieval. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2387–2392.
  5. Kiran Butt and Abid Hussain. 2021. Evaluation of Scholarly Information Retrieval Using Precision and Recall. Library Philosophy and Practice (2021), 1–11.
  6. Pre-training tasks for embedding-based large-scale retrieval. arXiv preprint arXiv:2002.03932 (2020).
  7. Gridmask data augmentation. arXiv preprint arXiv:2001.04086 (2020).
  8. Cross-language Sentence Selection via Data Augmentation and Rationale Training. arXiv preprint arXiv:2106.02293 (2021).
  9. TREC-2019-Deep-Learning. https://microsoft.github.io/TREC-2019-Deep-Learning/.
  10. Autoaugment: Learning augmentation strategies from data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 113–123.
  11. Zhuyun Dai and Jamie Callan. 2019. Deeper Text Understanding for IR with Contextual Neural Language Modeling. In ACM SIGIR’19. 985–988.
  12. Promptagator: Few-shot dense retrieval from 8 examples. arXiv preprint arXiv:2209.11755 (2022).
  13. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. CoRR abs/1810.04805 (2018). http://arxiv.org/abs/1810.04805
  14. Jeff Donahue and Karen Simonyan. 2019. Large scale adversarial representation learning. arXiv preprint arXiv:1907.02544 (2019).
  15. SPLADE: Sparse lexical and expansion model for first stage ranking. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2288–2292.
  16. Luke Gallagher. 2019. Pairwise t-test on TREC Run Files. https://github.com/lgrz/pairwise-ttest/.
  17. Luyu Gao and Jamie Callan. 2021. Condenser: a Pre-training Architecture for Dense Retrieval. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 981–993. https://doi.org/10.18653/v1/2021.emnlp-main.75
  18. Luyu Gao and Jamie Callan. 2022. Unsupervised Corpus Aware Language Model Pre-training for Dense Passage Retrieval. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Dublin, Ireland, 2843–2853. https://doi.org/10.18653/v1/2022.acl-long.203
  19. Fuzz testing based data augmentation to improve robustness of deep neural networks. In 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE). IEEE, 1147–1158.
  20. Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728 (2018).
  21. Neighbourhood components analysis. Advances in neural information processing systems 17 (2004).
  22. Supervised contrastive learning for pre-trained language model fine-tuning. arXiv preprint arXiv:2011.01403 (2020).
  23. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9729–9738.
  24. Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670 (2018).
  25. Local Self-Attention over Long Text for Efficient Document Retrieval. arXiv preprint arXiv:2005.04908 (2020).
  26. Interpretable & time-budget-constrained contextualization for re-ranking. arXiv preprint arXiv:2002.01854 (2020).
  27. A survey on contrastive self-supervised learning. Technologies 9, 1 (2021), 2.
  28. Dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2004.04906 (2020).
  29. Omar Khattab and Matei Zaharia. 2020. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. arXiv:2004.12832
  30. Supervised contrastive learning. Advances in Neural Information Processing Systems 33 (2020), 18661–18673.
  31. Data augmentation using pre-trained transformer models. arXiv preprint arXiv:2003.02245 (2020).
  32. An Experimental Study on Pretraining Transformers from Scratch for IR. https://doi.org/10.48550/ARXIV.2301.10444
  33. Extractive Explanations for Interpretable Text Ranking. ACM Transactions on Information Systems (2021).
  34. Fast Forward Indexes for Efficient Document Ranking. arXiv preprint arXiv:2110.06051 (2021).
  35. PARADE: Passage Representation Aggregation for Document Reranking. arXiv preprint arXiv:2008.09093 (2020).
  36. The power of selecting key blocks with local pre-ranking for long document information retrieval. ACM Transactions on Information Systems 41, 3 (2023), 1–35.
  37. More Robust Dense Retrieval with Contrastive Dual Learning. In Proceedings of the 2021 ACM SIGIR International Conference on Theory of Information Retrieval. 287–296.
  38. Retrieve Synonymous keywords for Frequent Queries in Sponsored Search in a Data Augmentation Way. arXiv preprint arXiv:2008.01969 (2020).
  39. Jimmy Lin. 2019. The Neural Hype and Comparisons Against Weak Baselines. In ACM SIGIR Forum, Vol. 52. ACM, 40–51.
  40. Efficient Training of Retrieval Models using Negative Cache. In Advances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (Eds.), Vol. 34. Curran Associates, Inc., 4134–4146. https://proceedings.neurips.cc/paper/2021/file/2175f8c5cd9604f6b1e576b252d4c86e-Paper.pdf
  41. Large-margin softmax loss for convolutional neural networks.. In ICML, Vol. 2. 7.
  42. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
  43. How Effective is Task-Agnostic Data Augmentation for Pretrained Transformers? arXiv preprint arXiv:2010.01764 (2020).
  44. Zero-Shot Listwise Document Reranking with a Large Language Model. arXiv preprint arXiv:2305.02156 (2023).
  45. Contextualized Word Representations for Document Re-Ranking. arXiv preprint arXiv:1904.07094 (2019).
  46. Textattack: A framework for adversarial attacks, data augmentation, and adversarial training in nlp. arXiv preprint arXiv:2005.05909 (2020).
  47. VIVA: visual information retrieval in video archives. International Journal on Digital Libraries 23, 4 (2022), 319–333.
  48. Improving sequence-to-sequence speech recognition training with on-the-fly data augmentation. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 7689–7693.
  49. Rodrigo Nogueira and Kyunghyun Cho. 2019. Passage Re-ranking with BERT. CoRR abs/1901.04085 (2019). http://arxiv.org/abs/1901.04085
  50. Helmi Satria Nugraha and Suyanto Suyanto. 2019. Typographic-based data augmentation to improve a question retrieval in short dialogue system. In 2019 International Seminar on Research of Information Technology and Intelligent Systems (ISRITI). IEEE, 44–49.
  51. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018).
  52. Data augmentation for spoken language understanding via pretrained models. arXiv e-prints (2020), arXiv–2004.
  53. CoSDA-ML: multi-lingual code-switching data augmentation for zero-shot cross-lingual NLP. In Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence. 3853–3860.
  54. Large Language Models are Effective Text Rankers with Pairwise Ranking Prompting. arXiv preprint arXiv:2306.17563 (2023).
  55. RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Online, 5835–5847. https://doi.org/10.18653/v1/2021.naacl-main.466
  56. Automatic data augmentation for generalization in deep reinforcement learning. arXiv preprint arXiv:2006.12862 (2020).
  57. Synthetic Data Augmentation for Zero-Shot Cross-Lingual Question Answering. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 7016–7030.
  58. Koustav Rudra and Avishek Anand. 2020. Distant supervision in BERT-based adhoc document retrieval. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management. 2197–2200.
  59. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019).
  60. Connor Shorten and Taghi M Khoshgoftaar. 2019. A survey on image data augmentation for deep learning. Journal of Big Data 6, 1 (2019), 1–48.
  61. Text Data Augmentation for Deep Learning. Journal of big Data 8, 1 (2021), 1–34.
  62. History by diversity: Helping historians search news archives. In Proceedings of the 2016 ACM on conference on human information interaction and retrieval. 183–192.
  63. Kihyuk Sohn. 2016. Improved deep metric learning with multi-class n-pair loss objective. Advances in neural information processing systems 29 (2016).
  64. Mixup-Transformer: Dynamic Data Augmentation for NLP Tasks. arXiv preprint arXiv:2010.02394 (2020).
  65. Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agent. arXiv preprint arXiv:2304.09542 (2023).
  66. Improved Representation Learning for Question Answer Matching. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Berlin, Germany, 464–473. https://doi.org/10.18653/v1/P16-1044
  67. BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2). https://openreview.net/forum?id=wCu6T5xFjeJ
  68. Cheap and Good? Simple and Effective Data Augmentation for Low Resource Machine Reading. arXiv preprint arXiv:2106.04134 (2021).
  69. Tongzhou Wang and Phillip Isola. 2020. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In International Conference on Machine Learning. PMLR, 9929–9939.
  70. On the unreasonable effectiveness of centroids in image retrieval. In International Conference on Neural Information Processing. Springer, 212–223.
  71. Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3733–3742.
  72. Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval. arXiv preprint arXiv:2007.00808 (2020).
  73. xmoco: Cross momentum contrastive learning for open-domain question answering. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 6120–6129.
  74. Data augmentation for bert fine-tuning in open-domain question answering. arXiv preprint arXiv:1904.06652 (2019).
  75. Neural Retrieval for Question Answering with Cross-Attention Supervised Data Augmentation. arXiv preprint arXiv:2009.13815 (2020).
  76. Domain transfer based data augmentation for neural query translation. In Proceedings of the 28th International Conference on Computational Linguistics. 4521–4533.
  77. Cross-domain modeling of sentence-level evidence for document retrieval. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). 3490–3496.
  78. A data augmentation-based defense method against adversarial attacks in neural networks. In International Conference on Algorithms and Architectures for Parallel Processing. Springer, 274–289.
  79. Optimizing Dense Retrieval Model Training with Hard Negatives. Association for Computing Machinery, New York, NY, USA, 1503–1512. https://doi.org/10.1145/3404835.3462880
  80. Text Augmentation for Neural Machine Translation: A Review. arXiv preprint arXiv:2103.09065 (2021).
  81. Explain and predict, and then predict again. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining. 418–426.
  82. Zhilu Zhang and Mert R Sabuncu. 2018. Generalized cross entropy loss for training deep neural networks with noisy labels. In 32nd Conference on Neural Information Processing Systems (NeurIPS).
  83. Random erasing data augmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 13001–13008.
  84. Data Augmentation for Retrieval-and Generation-Based Dialog Systems. In 2020 IEEE 6th International Conference on Computer and Communications (ICCC). IEEE, 1716–1720.
  85. Contrastive Learning of User Behavior Sequence for Context-Aware Document Ranking. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management (Virtual Event, Queensland, Australia) (CIKM ’21). Association for Computing Machinery, New York, NY, USA, 2780–2791. https://doi.org/10.1145/3459637.3482243
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Abhijit Anand (10 papers)
  2. Jurek Leonhardt (11 papers)
  3. Jaspreet Singh (41 papers)
  4. Koustav Rudra (14 papers)
  5. Avishek Anand (81 papers)
Citations (3)

Summary

We haven't generated a summary for this paper yet.