Learning Effective Representations for Retrieval Using Self-Distillation with Adaptive Relevance Margins
Abstract: Representation-based retrieval models, so-called bi-encoders, estimate the relevance of a document to a query by calculating the similarity of their respective embeddings. Current state-of-the-art bi-encoders are trained using an expensive training regime involving knowledge distillation from a teacher model and batch-sampling. Instead of relying on a teacher model, we contribute a novel parameter-free loss function for self-supervision that exploits the pre-trained language modeling capabilities of the encoder model as a training signal, eliminating the need for batch sampling by performing implicit hard negative mining. We investigate the capabilities of our proposed approach through extensive experiments, demonstrating that self-distillation can match the effectiveness of teacher distillation using only 13.5% of the data, while offering a speedup in training time between 3x and 15x compared to parametrized losses. All code and data is made openly available.
- Hard Negatives or False Negatives: CorrectingPooling Bias in Training Neural Ranking Models. InProceedings of the 31st ACM InternationalConference on Information & Knowledge Management, Atlanta, GA, USA,October 17-21, 2022, M. Al Hasan andLi X. (Eds.). ACM,118–127.
- Pre-training Tasks for Embedding-based Large-scaleRetrieval. In 8th International Conference onLearning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30,2020. OpenReview.net.
- Overview of the TREC 2020 Deep Learning Track.In Proceedings of the Twenty-Ninth Text REtrievalConference, TREC 2020, Virtual Event [Gaithersburg, Maryland, USA],November 16-20, 2020 (NIST Special Publication,Vol. 1266), E. M.Voorhees and A. Ellis (Eds.).National Institute of Standards and Technology (NIST).
- Overview of the TREC 2019 deep learning track. (2020). arXiv:2003.07820
- BERT: Pre-training of Deep BidirectionalTransformers for Language Understanding. InProceedings of the 2019 Conference of the NorthAmerican Chapter of the Association for Computational Linguistics: HumanLanguage Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7,2019, Volume 1 (Long and Short Papers),J. Burstein, C. Doran,and T. Solorio (Eds.). ACL,4171–4186.
- William Falcon and ThePyTorch Lightning team. 2019. PyTorch Lightning.
- Condenser: a Pre-training Architecture for DenseRetrieval. In Proceedings of the 2021 Conferenceon Empirical Methods in Natural Language Processing, EMNLP 2021, VirtualEvent / Punta Cana, Dominican Republic, 7-11 November, 2021,M.-F. Moens,X. Huang, L. Specia, andS. Wen-tau Yih (Eds.). ACL,981–993.
- Complement Lexical Retrieval Model with SemanticResidual Embeddings. In Advances in InformationRetrieval - 43rd European Conference on IR Research, ECIR 2021, VirtualEvent, March 28 - April 1, 2021, Proceedings, Part I(Lecture Notes in Computer Science,Vol. 12656),D. Hiemstra, M.-F.Moens, J. Mothe, R. Perego,M. Potthast, and F. Sebastiani (Eds.).Springer, 146–160.
- Knowledge Distillation: A Survey. Int. J. Comput. Vis. 129,6 (2021), 1789–1819.
- Deep Ranking with Adaptive Margin Triplet Loss. (2021). arXiv:2107.06187
- Sangchul Hahn andHeeyoul Choi. 2019. Self-Knowledge Distillation in Natural LanguageProcessing. In Proceedings of the InternationalConference on Recent Advances in Natural Language Processing, RANLP 2019,Varna, Bulgaria, September 2-4, 2019,R. Mitkov andG. Angelova (Eds.). INCOMA Ltd.,423–430.
- Distilling the Knowledge in a Neural Network. (2015). arXiv:1503.02531
- Improving Efficient Neural Ranking Models withCross-Architecture Knowledge Distillation. (2020). arXiv:2010.02666
- Efficiently Teaching an Effective Dense Retrieverwith Balanced Topic Aware Sampling. In SIGIR’21: The 44th International ACM SIGIR Conference on Research andDevelopment in Information Retrieval, Virtual Event, Canada, July 11-15,2021, F. Diaz,C. Shah, T. Suel,P. Castells, R. Jones, andT. Sakai (Eds.). ACM,113–122.
- Teacher-Student Architecture for KnowledgeDistillation: A Survey. (2023). arXiv:2308.04268
- Billion-Scale Similarity Search with GPUs. IEEE Trans. Big Data 7,3 (2021), 535–547.
- Dense Passage Retrieval for Open-Domain QuestionAnswering. In Proceedings of the 2020 Conferenceon Empirical Methods in Natural Language Processing, EMNLP 2020, Online,November 16-20, 2020, B. Webber,T. Cohn, Y. He, andY. Liu (Eds.). ACL,6769–6781.
- Boosting Few-Shot Learning With Adaptive MarginLoss. In 2020 IEEE/CVF Conference on ComputerVision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19,2020. Computer Vision Foundation / IEEE,12573–12581.
- RoBERTa: A Robustly Optimized BERT PretrainingApproach. CoRR abs/1907.11692(2019). arXiv:1907.11692
- Yang Liu, Sheng Shen,and Mirella Lapata. 2021. Noisy Self-Knowledge Distillation for TextSummarization. In Proceedings of the 2021Conference of the North American Chapter of the Association for ComputationalLinguistics: Human Language Technologies, NAACL-HLT 2021, Online, June6-11, 2021, K. Toutanova,A. Rumshisky, L. Zettlemoyer,D. Hakkani-Tür, I. Beltagy,S. Bethard, R. Cotterell,T. Chakraborty, and Y. Zhou (Eds.).ACL, 692–703.
- MS MARCO: A Human Generated MAchine ReadingCOmprehension Dataset. In Proceedings of theWorkshop on Cognitive Computation: Integrating neural and symbolic approaches2016 co-located with the 30th Annual Conference on Neural InformationProcessing Systems (NIPS 2016), Barcelona, Spain, December 9, 2016(CEUR Workshop Proceedings,Vol. 1773), T. R.Besold, A. Bordes, A. S. d’Avila Garcez,and G. Wayne (Eds.). CEUR-WS.org.
- Multi-Stage Document Ranking with BERT. (2019). arXiv:1910.14424
- PyTorch: An Imperative Style, High-Performance DeepLearning Library. In Advances in NeuralInformation Processing Systems 32: Annual Conference on Neural InformationProcessing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC,Canada, H. M. Wallach,H. Larochelle, A. Beygelzimer,F. d’Alché-Buc, E. B. Fox, andR. Garnett (Eds.). 8024–8035.
- DistilBERT, a distilled version of BERT: smaller,faster, cheaper and lighter. (2019). arXiv:1910.01108
- Reduce, Reuse, Recycle: Green Information RetrievalResearch. In SIGIR ’22: The 45th InternationalACM SIGIR Conference on Research and Development in InformationRetrieval, Madrid, Spain, July 11 - 15, 2022,E. Amigó,P. Castells, J. Gonzalo,B. Carterette, J. S. Culpepper, andG. Kazai (Eds.). ACM,2825–2837.
- FaceNet: A unified embedding for face recognitionand clustering. In IEEE Conference on ComputerVision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12,2015. IEEE Computer Society,815–823.
- AUC-CL: A Batchsize-Robust Framework forSelf-Supervised Contrastive Representation Learning. InThe Twelfth International Conference on LearningRepresentations.
- MPNet: Masked and Permuted Pre-training forLanguage Understanding. In Advances in NeuralInformation Processing Systems 33: Annual Conference on Neural InformationProcessing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual,H. Larochelle,M. Ranzato, R. Hadsell,M.-F. Balcan, and H.-T. Lin(Eds.).
- Balanced Topic Aware Sampling for Effective DenseRetriever: A Reproducibility Study. InProceedings of the 46th International ACM SIGIRConference on Research and Development in Information Retrieval, SIGIR2023, Taipei, Taiwan, July 23-27, 2023,H.-H. Chen,W.-J. (Edward) Duh, H.-H. Huang,M. P. Kato, J. Mothe, andB. Poblete (Eds.). ACM,2542–2551.
- Transformers: State-of-the-Art Natural LanguageProcessing. In Proceedings of the 2020 Conferenceon Empirical Methods in Natural Language Processing: System Demonstrations,EMNLP 2020 - Demos, Online, November 16-20, 2020,Q. Liu andD. Schlangen (Eds.). ACL,38–45.
- Approximate Nearest Neighbor Negative ContrastiveLearning for Dense Text Retrieval. In 9thInternational Conference on Learning Representations, ICLR 2021, VirtualEvent, Austria, May 3-7, 2021. OpenReview.net.
- Optimizing Dense Retrieval Model Training with HardNegatives. In SIGIR ’21: The 44th InternationalACM SIGIR Conference on Research and Development in InformationRetrieval, Virtual Event, Canada, July 11-15, 2021,F. Diaz, C. Shah,T. Suel, P. Castells,R. Jones, and T. Sakai (Eds.).ACM, 1503–1512.
- Dense Text Retrieval based on Pretrained LanguageModels: A Survey. (2022). arXiv:2211.14876
- Preserve Pre-trained Knowledge: Transfer LearningWith Self-Distillation For Action Recognition. (2022). arXiv:2205.00506
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.