GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning (2402.16829v1)
Abstract: Embedding models are integral to AI applications like semantic search, personalized recommendations, and retrieval augmented generation for LLMs, necessitating high-quality training data. However, the limited scalability of manual data curation prompts the need for automated methods to ensure data integrity. Traditional unsupervised triplet mining automates training data generation, crucial for embedding model training, yet inadvertently injects biases and noise, thereby degrading model performance. Addressing this, we introduce GISTEmbed, a novel strategy that enhances in-batch negative selection during contrastive training through a guide model. This approach departs from reliance on random sampling and equal utility assumption of batch negatives, significantly reducing noise from data quality issues and improving model fine-tuning. Benchmarked against the Massive Text Embedding Benchmark (MTEB), GISTEmbed showcases consistent performance improvements across various model sizes and achieves state-of-the-art results in select categories. This framework enables significant enhancements for smaller models by leveraging the capabilities of powerful yet resource-intensive large models. GISTEmbed can potentially revolutionize the creation of highly efficient, smaller models, democratizing access to advanced AI technologies. Making these technologies more accessible and cost-effective, especially for applications constrained by resources, significantly expands the impact and accessibility of state-of-the-art AI solutions across diverse sectors.
- Exploring the impact of negative samples of contrastive learning: A case study of sentence embedding. arXiv preprint arXiv:2202.13093, 2022. URL https://arxiv.org/abs/2202.13093.
- Chateval: Towards better llm-based evaluators through multi-agent debate, 2023. URL https://arxiv.org/abs/2308.07201.
- Improving contrastive learning of sentence embeddings from ai feedback. arXiv preprint arXiv:2305.01918, 2023. URL https://arxiv.org/abs/2305.01918.
- SimCSE: Simple contrastive learning of sentence embeddings. In Moens, M.-F., Huang, X., Specia, L., and Yih, S. W.-t. (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 6894–6910, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.552. URL https://aclanthology.org/2021.emnlp-main.552.
- Efficient natural language response suggestion for smart reply. arXiv preprint arXiv:1705.00652, 2017. URL https://arxiv.org/abs/1705.00652.
- Learning distributed representations of sentences from unlabelled data. In Knight, K., Nenkova, A., and Rambow, O. (eds.), Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1367–1377, San Diego, California, June 2016. Association for Computational Linguistics. doi: 10.18653/v1/N16-1162. URL https://aclanthology.org/N16-1162.
- Whiteningbert: An easy unsupervised sentence embedding approach. arXiv preprint arXiv:2104.01767, 2021. URL https://arxiv.org/abs/2104.01767.
- Large language models can self-improve. In Bouamor, H., Pino, J., and Bali, K. (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 1051–1068, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.67. URL https://aclanthology.org/2023.emnlp-main.67.
- Dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2004.04906, 2020. URL https://arxiv.org/abs/2004.04906.
- Skip-thought vectors. In Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015. URL https://proceedings.neurips.cc/paper_files/paper/2015/file/f442d33fa06832082290ad8544a8da27-Paper.pdf.
- Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474, 2020.
- Angle-optimized text embeddings. arXiv preprint arXiv:2309.12871, 2023. URL https://arxiv.org/abs/2309.12871.
- 2d matryoshka sentence embeddings. arXiv preprint arXiv:2402.14776, 2024. URL https://arxiv.org/abs/2402.14776.
- Towards general text embeddings with multi-stage contrastive learning. arXiv preprint arXiv:2308.03281, 2023. URL https://arxiv.org/abs/2308.03281.
- Universal text representation from bert: An empirical study. arXiv preprint arXiv:1910.07973, 2019. URL https://arxiv.org/abs/1910.07973.
- Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013. URL https://arxiv.org/abs/1301.3781.
- Muennighoff, N. Sgpt: Gpt sentence embeddings for semantic search. arXiv preprint arXiv:2202.08904, 2022. URL https://arxiv.org/abs/2202.08904.
- Mteb: Massive text embedding benchmark. arXiv preprint arXiv:2210.07316, 2022. doi: 10.48550/ARXIV.2210.07316. URL https://arxiv.org/abs/2210.07316.
- Large dual encoders are generalizable retrievers. arXiv preprint arXiv:2112.07899, 2021. URL https://arxiv.org/abs/2112.07899.
- Embedding-based news recommendation for millions of users. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 1933–1942, 2017.
- Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018. URL https://arxiv.org/abs/1807.03748.
- Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084, 2019. URL https://arxiv.org/abs/1908.10084.
- V-measure: A conditional entropy-based external cluster evaluation measure. In Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL), pp. 410–420, 2007.
- One embedder, any task: Instruction-finetuned text embeddings. 2022. URL https://arxiv.org/abs/2212.09741.
- ESimCSE: Enhanced sample building method for contrastive learning of unsupervised sentence embedding. In Calzolari, N., Huang, C.-R., Kim, H., Pustejovsky, J., Wanner, L., Choi, K.-S., Ryu, P.-M., Chen, H.-H., Donatelli, L., Ji, H., Kurohashi, S., Paggio, P., Xue, N., Kim, S., Hahm, Y., He, Z., Lee, T. K., Santus, E., Bond, F., and Na, S.-H. (eds.), Proceedings of the 29th International Conference on Computational Linguistics, pp. 3898–3907, Gyeongju, Republic of Korea, October 2022. International Committee on Computational Linguistics. URL https://aclanthology.org/2022.coling-1.342.
- C-pack: Packaged resources to advance general chinese embedding. arXiv preprint arXiv:2309.07597, 2023. URL https://arxiv.org/abs/2309.07597.
- Contrastive learning models for sentence representations. ACM Transactions on Intelligent Systems and Technology, 14(4):1–34, 2023.
- ConSERT: A contrastive framework for self-supervised sentence representation transfer. In Zong, C., Xia, F., Li, W., and Navigli, R. (eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 5065–5075, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.393. URL https://aclanthology.org/2021.acl-long.393.
- Retrieve anything to augment large language models. arXiv preprint arXiv:2310.07554, 2023. URL https://arxiv.org/abs/2310.07554.
- Debiased contrastive learning of unsupervised sentence representations. In Muresan, S., Nakov, P., and Villavicencio, A. (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 6120–6130, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.423. URL https://aclanthology.org/2022.acl-long.423.
- Aivin V. Solatorio (5 papers)