Piccolo2: General Text Embedding with Multi-task Hybrid Loss Training (2405.06932v1)
Abstract: In this report, we introduce Piccolo2, an embedding model that surpasses other models in the comprehensive evaluation over 6 tasks on CMTEB benchmark, setting a new state-of-the-art. Piccolo2 primarily leverages an efficient multi-task hybrid loss training approach, effectively harnessing textual data and labels from diverse downstream tasks. In addition, Piccolo2 scales up the embedding dimension and uses MRL training to support more flexible vector dimensions. The latest information of piccolo models can be accessed via: https://huggingface.co/sensenova/
- alinlp. gte-qwen1.5-7b-instruct. hugging face, 2024. URL: https://huggingface.co/Alibaba-NLP/gte-Qwen1.5-7B-instruct.
- aliyun. aliyun-text-embedding. hugging face, 2024. URL: https://help.aliyun.com/zh/open-search.
- baichuan. baichuan-embedding, 2024. URL: https://platform.baichuan-ai.com/docs/text-Embedding.
- mmarco: A multilingual version of the ms marco passage ranking dataset. arXiv preprint arXiv:2108.13897, 2021.
- A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020.
- Simcse: Simple contrastive learning of sentence embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6894–6910, 2021.
- Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 297–304. JMLR Workshop and Conference Proceedings, 2010.
- Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
- Ocnli: Original chinese natural language inference. arXiv preprint arXiv:2010.05444, 2020.
- infgrad. stella-mrl-large-zh-v3.5-1792d. hugging face, 2024. URL: https://huggingface.co/infgrad/stella-mrl-large-zh-v3.5-1792d.
- intsig. acge-text-embedding. hugging face, 2024. URL: https://huggingface.co/aspire/acge_text_embedding.
- junqin huang. piccolo-large-zh. hugging face, 2023. URL: https://huggingface.co/sensenova/piccolo-large-zh.
- kuroneko5943. Jdreview dataset, 2023. URL: https://huggingface.co/datasets/kuroneko5943/jd21.
- Matryoshka representation learning. Advances in Neural Information Processing Systems, 35:30233–30249, 2022.
- Gecko: Versatile text embeddings distilled from large language models. arXiv preprint arXiv:2403.20327, 2024.
- A comparison and semi-quantitative analysis of words and character-bigrams as features in chinese text categorization. In proceedings of the 21st international conference on computational linguistics and 44th annual meeting of the association for computational linguistics, pages 545–552, 2006.
- Csl: A large-scale chinese scientific literature dataset. arXiv preprint arXiv:2209.05034, 2022.
- Towards general text embeddings with multi-stage contrastive learning. arXiv preprint arXiv:2308.03281, 2023.
- Multi-cpr: A multi domain chinese dataset for passage retrieval. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 3046–3056, 2022.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- Hidden factors and hidden topics: understanding rating dimensions with review text. In Proceedings of the 7th ACM conference on Recommender systems, pages 165–172, 2013.
- Mteb: Massive text embedding benchmark. arXiv preprint arXiv:2210.07316, 2022.
- openai. text-embedding-v3. openai blogs, 2024. URL: https://openai.com/blog/new-embedding-models-and-api-updates.
- Dureader_retrieval: A large-scale chinese benchmark for passage retrieval from web search engine. arXiv preprint arXiv:2203.10232, 2022.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE, 2020.
- Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, 2019.
- Ye Liu Rui Meng. Sfr-embedding-mistral:enhance text retrieval with transfer learning. Salesforce AI Research Blog, 2024. URL: https://blog.salesforceairesearch.com/sfr-embedded-mistral.
- Darius Koenig Sean Lee, Aamir Shakir. Open source strikes bread - new fluffy embeddings model, 2024. URL: https://www.mixedbread.ai/blog/mxbai-embed-large-v1.
- shibing624. Nlizh dataset, 2023. URL: https://huggingface.co/datasets/shibing624.
- SophonPlus. Onlineshoppingdataset, 2016. URL: https://github.com/SophonPlus/ChineseNlpCorpus.
- Jianlin Su. Cosent (1): A more effective sentence vector scheme than sentence bert, Jan 2022. URL: https://kexue.fm/archives/8847.
- Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533, 2022.
- Improving text embeddings with large language models. arXiv preprint arXiv:2401.00368, 2023.
- Cord-19: The covid-19 open research dataset. ArXiv, 2020.
- C-pack: Packaged resources to advance general chinese embedding. arXiv preprint arXiv:2309.07597, 2023.
- T2ranking: A large-scale chinese benchmark for passage ranking. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2681–2690, 2023.
- Clue: A chinese language understanding evaluation benchmark. arXiv preprint arXiv:2004.05986, 2020.
- Multi-scale attentive interaction networks for chinese medical question answer selection. IEEE Access, 6:74061–74071, 2018. doi:10.1109/ACCESS.2018.2883637.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.