Papers
Topics
Authors
Recent
Search
2000 character limit reached

Learning Effective Representations for Retrieval Using Self-Distillation with Adaptive Relevance Margins

Published 31 Jul 2024 in cs.IR | (2407.21515v2)

Abstract: Representation-based retrieval models, so-called bi-encoders, estimate the relevance of a document to a query by calculating the similarity of their respective embeddings. Current state-of-the-art bi-encoders are trained using an expensive training regime involving knowledge distillation from a teacher model and batch-sampling. Instead of relying on a teacher model, we contribute a novel parameter-free loss function for self-supervision that exploits the pre-trained language modeling capabilities of the encoder model as a training signal, eliminating the need for batch sampling by performing implicit hard negative mining. We investigate the capabilities of our proposed approach through extensive experiments, demonstrating that self-distillation can match the effectiveness of teacher distillation using only 13.5% of the data, while offering a speedup in training time between 3x and 15x compared to parametrized losses. All code and data is made openly available.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (34)
  1. Hard Negatives or False Negatives: CorrectingPooling Bias in Training Neural Ranking Models. InProceedings of the 31st ACM InternationalConference on Information & Knowledge Management, Atlanta, GA, USA,October 17-21, 2022, M. Al Hasan andLi X. (Eds.). ACM,118–127.
  2. Pre-training Tasks for Embedding-based Large-scaleRetrieval. In 8th International Conference onLearning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30,2020. OpenReview.net.
  3. Overview of the TREC 2020 Deep Learning Track.In Proceedings of the Twenty-Ninth Text REtrievalConference, TREC 2020, Virtual Event [Gaithersburg, Maryland, USA],November 16-20, 2020 (NIST Special Publication,Vol. 1266), E. M.Voorhees and A. Ellis (Eds.).National Institute of Standards and Technology (NIST).
  4. Overview of the TREC 2019 deep learning track. (2020). arXiv:2003.07820
  5. BERT: Pre-training of Deep BidirectionalTransformers for Language Understanding. InProceedings of the 2019 Conference of the NorthAmerican Chapter of the Association for Computational Linguistics: HumanLanguage Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7,2019, Volume 1 (Long and Short Papers),J. Burstein, C. Doran,and T. Solorio (Eds.). ACL,4171–4186.
  6. William Falcon and ThePyTorch Lightning team. 2019. PyTorch Lightning.
  7. Condenser: a Pre-training Architecture for DenseRetrieval. In Proceedings of the 2021 Conferenceon Empirical Methods in Natural Language Processing, EMNLP 2021, VirtualEvent / Punta Cana, Dominican Republic, 7-11 November, 2021,M.-F. Moens,X. Huang, L. Specia, andS. Wen-tau Yih (Eds.). ACL,981–993.
  8. Complement Lexical Retrieval Model with SemanticResidual Embeddings. In Advances in InformationRetrieval - 43rd European Conference on IR Research, ECIR 2021, VirtualEvent, March 28 - April 1, 2021, Proceedings, Part I(Lecture Notes in Computer Science,Vol. 12656),D. Hiemstra, M.-F.Moens, J. Mothe, R. Perego,M. Potthast, and F. Sebastiani (Eds.).Springer, 146–160.
  9. Knowledge Distillation: A Survey. Int. J. Comput. Vis. 129,6 (2021), 1789–1819.
  10. Deep Ranking with Adaptive Margin Triplet Loss. (2021). arXiv:2107.06187
  11. Sangchul Hahn andHeeyoul Choi. 2019. Self-Knowledge Distillation in Natural LanguageProcessing. In Proceedings of the InternationalConference on Recent Advances in Natural Language Processing, RANLP 2019,Varna, Bulgaria, September 2-4, 2019,R. Mitkov andG. Angelova (Eds.). INCOMA Ltd.,423–430.
  12. Distilling the Knowledge in a Neural Network. (2015). arXiv:1503.02531
  13. Improving Efficient Neural Ranking Models withCross-Architecture Knowledge Distillation. (2020). arXiv:2010.02666
  14. Efficiently Teaching an Effective Dense Retrieverwith Balanced Topic Aware Sampling. In SIGIR’21: The 44th International ACM SIGIR Conference on Research andDevelopment in Information Retrieval, Virtual Event, Canada, July 11-15,2021, F. Diaz,C. Shah, T. Suel,P. Castells, R. Jones, andT. Sakai (Eds.). ACM,113–122.
  15. Teacher-Student Architecture for KnowledgeDistillation: A Survey. (2023). arXiv:2308.04268
  16. Billion-Scale Similarity Search with GPUs. IEEE Trans. Big Data 7,3 (2021), 535–547.
  17. Dense Passage Retrieval for Open-Domain QuestionAnswering. In Proceedings of the 2020 Conferenceon Empirical Methods in Natural Language Processing, EMNLP 2020, Online,November 16-20, 2020, B. Webber,T. Cohn, Y. He, andY. Liu (Eds.). ACL,6769–6781.
  18. Boosting Few-Shot Learning With Adaptive MarginLoss. In 2020 IEEE/CVF Conference on ComputerVision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19,2020. Computer Vision Foundation / IEEE,12573–12581.
  19. RoBERTa: A Robustly Optimized BERT PretrainingApproach. CoRR abs/1907.11692(2019). arXiv:1907.11692
  20. Yang Liu, Sheng Shen,and Mirella Lapata. 2021. Noisy Self-Knowledge Distillation for TextSummarization. In Proceedings of the 2021Conference of the North American Chapter of the Association for ComputationalLinguistics: Human Language Technologies, NAACL-HLT 2021, Online, June6-11, 2021, K. Toutanova,A. Rumshisky, L. Zettlemoyer,D. Hakkani-Tür, I. Beltagy,S. Bethard, R. Cotterell,T. Chakraborty, and Y. Zhou (Eds.).ACL, 692–703.
  21. MS MARCO: A Human Generated MAchine ReadingCOmprehension Dataset. In Proceedings of theWorkshop on Cognitive Computation: Integrating neural and symbolic approaches2016 co-located with the 30th Annual Conference on Neural InformationProcessing Systems (NIPS 2016), Barcelona, Spain, December 9, 2016(CEUR Workshop Proceedings,Vol. 1773), T. R.Besold, A. Bordes, A. S. d’Avila Garcez,and G. Wayne (Eds.). CEUR-WS.org.
  22. Multi-Stage Document Ranking with BERT. (2019). arXiv:1910.14424
  23. PyTorch: An Imperative Style, High-Performance DeepLearning Library. In Advances in NeuralInformation Processing Systems 32: Annual Conference on Neural InformationProcessing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC,Canada, H. M. Wallach,H. Larochelle, A. Beygelzimer,F. d’Alché-Buc, E. B. Fox, andR. Garnett (Eds.). 8024–8035.
  24. DistilBERT, a distilled version of BERT: smaller,faster, cheaper and lighter. (2019). arXiv:1910.01108
  25. Reduce, Reuse, Recycle: Green Information RetrievalResearch. In SIGIR ’22: The 45th InternationalACM SIGIR Conference on Research and Development in InformationRetrieval, Madrid, Spain, July 11 - 15, 2022,E. Amigó,P. Castells, J. Gonzalo,B. Carterette, J. S. Culpepper, andG. Kazai (Eds.). ACM,2825–2837.
  26. FaceNet: A unified embedding for face recognitionand clustering. In IEEE Conference on ComputerVision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12,2015. IEEE Computer Society,815–823.
  27. AUC-CL: A Batchsize-Robust Framework forSelf-Supervised Contrastive Representation Learning. InThe Twelfth International Conference on LearningRepresentations.
  28. MPNet: Masked and Permuted Pre-training forLanguage Understanding. In Advances in NeuralInformation Processing Systems 33: Annual Conference on Neural InformationProcessing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual,H. Larochelle,M. Ranzato, R. Hadsell,M.-F. Balcan, and H.-T. Lin(Eds.).
  29. Balanced Topic Aware Sampling for Effective DenseRetriever: A Reproducibility Study. InProceedings of the 46th International ACM SIGIRConference on Research and Development in Information Retrieval, SIGIR2023, Taipei, Taiwan, July 23-27, 2023,H.-H. Chen,W.-J. (Edward) Duh, H.-H. Huang,M. P. Kato, J. Mothe, andB. Poblete (Eds.). ACM,2542–2551.
  30. Transformers: State-of-the-Art Natural LanguageProcessing. In Proceedings of the 2020 Conferenceon Empirical Methods in Natural Language Processing: System Demonstrations,EMNLP 2020 - Demos, Online, November 16-20, 2020,Q. Liu andD. Schlangen (Eds.). ACL,38–45.
  31. Approximate Nearest Neighbor Negative ContrastiveLearning for Dense Text Retrieval. In 9thInternational Conference on Learning Representations, ICLR 2021, VirtualEvent, Austria, May 3-7, 2021. OpenReview.net.
  32. Optimizing Dense Retrieval Model Training with HardNegatives. In SIGIR ’21: The 44th InternationalACM SIGIR Conference on Research and Development in InformationRetrieval, Virtual Event, Canada, July 11-15, 2021,F. Diaz, C. Shah,T. Suel, P. Castells,R. Jones, and T. Sakai (Eds.).ACM, 1503–1512.
  33. Dense Text Retrieval based on Pretrained LanguageModels: A Survey. (2022). arXiv:2211.14876
  34. Preserve Pre-trained Knowledge: Transfer LearningWith Self-Distillation For Action Recognition. (2022). arXiv:2205.00506

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.