Data-CUBE: Data Curriculum for Instruction-based Sentence Representation Learning (2401.03563v1)
Abstract: Recently, multi-task instruction tuning has been applied into sentence representation learning, which endows the capability of generating specific representations with the guidance of task instruction, exhibiting strong generalization ability on new tasks. However, these methods mostly neglect the potential interference problems across different tasks and instances, which may affect the training and convergence of the model. To address it, we propose a data curriculum method, namely Data-CUBE, that arranges the orders of all the multi-task data for training, to minimize the interference risks from the two views. In the task level, we aim to find the optimal task order to minimize the total cross-task interference risk, which is exactly the traveling salesman problem, hence we utilize a simulated annealing algorithm to find its solution. In the instance level, we measure the difficulty of all instances per task, then divide them into the easy-to-difficult mini-batches for training. Experiments on MTEB sentence representation evaluation tasks show that our approach can boost the performance of state-of-the-art methods. Our code and data are publicly available at the link: \url{https://github.com/RUCAIBox/Data-CUBE}.
- A quantitative analysis of the simulated annealing algorithm: A case study for the traveling salesman problem. Journal of Statistical Physics, 50:187–206.
- Semeval-2015 task 2: Semantic textual similarity, english, spanish and pilot on interpretability. In Proceedings of the 9th International Workshop on Semantic Evaluation, SemEval@NAACL-HLT 2015, Denver, Colorado, USA, June 4-5, 2015, pages 252–263. The Association for Computer Linguistics.
- Semeval-2014 task 10: Multilingual semantic textual similarity. In Proceedings of the 8th International Workshop on Semantic Evaluation, SemEval@COLING 2014, Dublin, Ireland, August 23-24, 2014, pages 81–91. The Association for Computer Linguistics.
- Semeval-2016 task 1: Semantic textual similarity, monolingual and cross-lingual evaluation. In Proceedings of the 10th International Workshop on Semantic Evaluation, SemEval@NAACL-HLT 2016, San Diego, CA, USA, June 16-17, 2016, pages 497–511. The Association for Computer Linguistics.
- Semeval-2012 task 6: A pilot on semantic textual similarity. In Proceedings of the 6th International Workshop on Semantic Evaluation, SemEval@NAACL-HLT 2012, Montréal, Canada, June 7-8, 2012, pages 385–393. The Association for Computer Linguistics.
- *sem 2013 shared task: Semantic textual similarity. In Proceedings of the Second Joint Conference on Lexical and Computational Semantics, *SEM 2013, June 13-14, 2013, Atlanta, Georgia, USA, pages 32–43. Association for Computational Linguistics.
- Curriculum learning. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML 2009, Montreal, Quebec, Canada, June 14-18, 2009, volume 382 of ACM International Conference Proceeding Series, pages 41–48. ACM.
- Dimitris Bertsimas and John Tsitsiklis. 1993. Simulated annealing. Statistical science, 8(1):10–15.
- Efficient intent detection with dual sentence encoders. CoRR, abs/2003.04807.
- Universal sentence encoder for English. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 169–174, Brussels, Belgium. Association for Computational Linguistics.
- Semeval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation. In Proceedings of the 11th International Workshop on Semantic Evaluation, SemEval@ACL 2017, Vancouver, Canada, August 3-4, 2017, pages 1–14. Association for Computational Linguistics.
- Shouvik Chakraborty and Sandeep Bhowmik. 2015. An efficient approach to job shop scheduling problem using simulated annealing. International Journal of Hybrid Information Technology, 8(11):273–284.
- Omar Cheikhrouhou and Ines Khoufi. 2021. A comprehensive survey on the multiple traveling salesman problem: Applications, approaches and taxonomy. Comput. Sci. Rev., 40:100369.
- Semeval-2022 task 8: Multilingual news article similarity. In Proceedings of the 16th International Workshop on Semantic Evaluation, SemEval@NAACL 2022, Seattle, Washington, United States, July 14-15, 2022, pages 1094–1106. Association for Computational Linguistics.
- SPECTER: document-level representation learning using citation-informed transformers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 2270–2282. Association for Computational Linguistics.
- Simulated annealing: From basics to applications. Handbook of metaheuristics, pages 1–35.
- Mitigating task interference in multi-task learning via explicit task routing with non-learnable primitives. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 7756–7765. IEEE.
- Simcse: Simple contrastive learning of sentence embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pages 6894–6910. Association for Computational Linguistics.
- TWEAC: transformer with extendable QA agent classifiers. CoRR, abs/2104.07081.
- Jina embeddings: A novel set of high-performance sentence embedding models. CoRR, abs/2307.11224.
- Keld Helsgaun. 2006. An effective implementation of K-opt moves for the Lin-Kernighan TSP heuristic. Ph.D. thesis, Roskilde University. Department of Computer Science.
- Traveling salesman problem. Encyclopedia of operations research and management science, 1:1573–1578.
- Unsupervised dense information retrieval with contrastive learning. Trans. Mach. Learn. Res., 2022.
- Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, pages 6769–6781. Association for Computational Linguistics.
- A continuously growing dataset of sentential paraphrases. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017, pages 1224–1234. Association for Computational Linguistics.
- Semi-supervised question retrieval with gated convolutions. In NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego California, USA, June 12-17, 2016, pages 1279–1289. The Association for Computational Linguistics.
- Towards general text embeddings with multi-stage contrastive learning. CoRR, abs/2308.03281.
- Linkso: a dataset for learning to retrieve similar question answer pairs on software development forums. In Proceedings of the 4th ACM SIGSOFT International Workshop on NLP for Software Engineering, NL4SE@ESEC/SIGSOFT FSE 2018, Lake Buena Vista, FL, USA, November 4, 2018, pages 2–5. ACM.
- Traveling salesman problem: an overview of applications, formulations, and solution approaches. Traveling salesman problem, theory and applications, 1(1):1–25.
- Do text-to-text multi-task learners suffer from task conflict? In Findings of the Association for Computational Linguistics: EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pages 2843–2858. Association for Computational Linguistics.
- Niklas Muennighoff. 2022. SGPT: GPT sentence embeddings for semantic search. CoRR, abs/2202.08904.
- MTEB: massive text embedding benchmark. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2023, Dubrovnik, Croatia, May 2-6, 2023, pages 2006–2029. Association for Computational Linguistics.
- Text and code embeddings by contrastive pre-training. CoRR, abs/2201.10005.
- Sentence-t5: Scalable sentence encoders from pre-trained text-to-text models. In Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 1864–1874. Association for Computational Linguistics.
- Large dual encoders are generalizable retrievers. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pages 9844–9855. Association for Computational Linguistics.
- I wish I would have loved this one, but I didn’t - A multilingual dataset for counterfactual detection in product reviews. CoRR, abs/2104.06893.
- A comparative study of simulated annealing and genetic algorithm for solving the travelling salesman problem.
- Training language models to follow instructions with human feedback. In NeurIPS.
- Comparative performance of modified simulated annealing with simple simulated annealing for graph coloring problem. Procedia Computer Science, 9:321–327.
- GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, Doha, Qatar. Association for Computational Linguistics.
- Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67.
- Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pages 3980–3990. Association for Computational Linguistics.
- Adversarial domain adaptation for duplicate question detection. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pages 1056–1063. Association for Computational Linguistics.
- BIOSSES: a semantic sentence similarity estimation system for the biomedical domain. Bioinform., 33(14):i49–i58.
- One embedder, any task: Instruction-finetuned text embeddings. In Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023, pages 1102–1121. Association for Computational Linguistics.
- Text embeddings by weakly-supervised contrastive pre-training. CoRR, abs/2212.03533.
- Super-naturalinstructions: Generalization via declarative instructions on 1600+ NLP tasks. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pages 5085–5109. Association for Computational Linguistics.
- Finetuned language models are zero-shot learners. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net.
- MIND: A large-scale dataset for news recommendation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 3597–3606. Association for Computational Linguistics.
- C-pack: Packaged resources to advance general chinese embedding.
- Semeval-2015 task 1: Paraphrase and semantic similarity in twitter (PIT). In Proceedings of the 9th International Workshop on Semantic Evaluation, SemEval@NAACL-HLT 2015, Denver, Colorado, USA, June 4-5, 2015, pages 1–11. The Association for Computer Linguistics.
- Language models are universal embedders. CoRR, abs/2310.08232.
- A survey of large language models. CoRR, abs/2303.18223.
- Simans: Simple ambiguous negatives sampling for dense text retrieval.
- MASTER: multi-task pre-trained bottlenecked masked autoencoders are better dense retrievers. In Machine Learning and Knowledge Discovery in Databases: Research Track - European Conference, ECML PKDD 2023, Turin, Italy, September 18-22, 2023, Proceedings, Part II, volume 14170 of Lecture Notes in Computer Science, pages 630–647. Springer.
- Debiased contrastive learning of unsupervised sentence representations. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 6120–6130. Association for Computational Linguistics.
- Learning to perturb for contrastive learning of unsupervised sentence representations. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31:3935–3944.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.