Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 93 tok/s
Gemini 2.5 Pro 52 tok/s Pro
GPT-5 Medium 17 tok/s
GPT-5 High 14 tok/s Pro
GPT-4o 97 tok/s
GPT OSS 120B 455 tok/s Pro
Kimi K2 194 tok/s Pro
2000 character limit reached

Piccolo2: General Text Embedding with Multi-task Hybrid Loss Training (2405.06932v1)

Published 11 May 2024 in cs.CL and cs.AI

Abstract: In this report, we introduce Piccolo2, an embedding model that surpasses other models in the comprehensive evaluation over 6 tasks on CMTEB benchmark, setting a new state-of-the-art. Piccolo2 primarily leverages an efficient multi-task hybrid loss training approach, effectively harnessing textual data and labels from diverse downstream tasks. In addition, Piccolo2 scales up the embedding dimension and uses MRL training to support more flexible vector dimensions. The latest information of piccolo models can be accessed via: https://huggingface.co/sensenova/

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. alinlp. gte-qwen1.5-7b-instruct. hugging face, 2024. URL: https://huggingface.co/Alibaba-NLP/gte-Qwen1.5-7B-instruct.
  2. aliyun. aliyun-text-embedding. hugging face, 2024. URL: https://help.aliyun.com/zh/open-search.
  3. baichuan. baichuan-embedding, 2024. URL: https://platform.baichuan-ai.com/docs/text-Embedding.
  4. mmarco: A multilingual version of the ms marco passage ranking dataset. arXiv preprint arXiv:2108.13897, 2021.
  5. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020.
  6. Simcse: Simple contrastive learning of sentence embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6894–6910, 2021.
  7. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 297–304. JMLR Workshop and Conference Proceedings, 2010.
  8. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
  9. Ocnli: Original chinese natural language inference. arXiv preprint arXiv:2010.05444, 2020.
  10. infgrad. stella-mrl-large-zh-v3.5-1792d. hugging face, 2024. URL: https://huggingface.co/infgrad/stella-mrl-large-zh-v3.5-1792d.
  11. intsig. acge-text-embedding. hugging face, 2024. URL: https://huggingface.co/aspire/acge_text_embedding.
  12. junqin huang. piccolo-large-zh. hugging face, 2023. URL: https://huggingface.co/sensenova/piccolo-large-zh.
  13. kuroneko5943. Jdreview dataset, 2023. URL: https://huggingface.co/datasets/kuroneko5943/jd21.
  14. Matryoshka representation learning. Advances in Neural Information Processing Systems, 35:30233–30249, 2022.
  15. Gecko: Versatile text embeddings distilled from large language models. arXiv preprint arXiv:2403.20327, 2024.
  16. A comparison and semi-quantitative analysis of words and character-bigrams as features in chinese text categorization. In proceedings of the 21st international conference on computational linguistics and 44th annual meeting of the association for computational linguistics, pages 545–552, 2006.
  17. Csl: A large-scale chinese scientific literature dataset. arXiv preprint arXiv:2209.05034, 2022.
  18. Towards general text embeddings with multi-stage contrastive learning. arXiv preprint arXiv:2308.03281, 2023.
  19. Multi-cpr: A multi domain chinese dataset for passage retrieval. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 3046–3056, 2022.
  20. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  21. Hidden factors and hidden topics: understanding rating dimensions with review text. In Proceedings of the 7th ACM conference on Recommender systems, pages 165–172, 2013.
  22. Mteb: Massive text embedding benchmark. arXiv preprint arXiv:2210.07316, 2022.
  23. openai. text-embedding-v3. openai blogs, 2024. URL: https://openai.com/blog/new-embedding-models-and-api-updates.
  24. Dureader_retrieval: A large-scale chinese benchmark for passage retrieval from web search engine. arXiv preprint arXiv:2203.10232, 2022.
  25. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  26. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE, 2020.
  27. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, 2019.
  28. Ye Liu Rui Meng. Sfr-embedding-mistral:enhance text retrieval with transfer learning. Salesforce AI Research Blog, 2024. URL: https://blog.salesforceairesearch.com/sfr-embedded-mistral.
  29. Darius Koenig Sean Lee, Aamir Shakir. Open source strikes bread - new fluffy embeddings model, 2024. URL: https://www.mixedbread.ai/blog/mxbai-embed-large-v1.
  30. shibing624. Nlizh dataset, 2023. URL: https://huggingface.co/datasets/shibing624.
  31. SophonPlus. Onlineshoppingdataset, 2016. URL: https://github.com/SophonPlus/ChineseNlpCorpus.
  32. Jianlin Su. Cosent (1): A more effective sentence vector scheme than sentence bert, Jan 2022. URL: https://kexue.fm/archives/8847.
  33. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533, 2022.
  34. Improving text embeddings with large language models. arXiv preprint arXiv:2401.00368, 2023.
  35. Cord-19: The covid-19 open research dataset. ArXiv, 2020.
  36. C-pack: Packaged resources to advance general chinese embedding. arXiv preprint arXiv:2309.07597, 2023.
  37. T2ranking: A large-scale chinese benchmark for passage ranking. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2681–2690, 2023.
  38. Clue: A chinese language understanding evaluation benchmark. arXiv preprint arXiv:2004.05986, 2020.
  39. Multi-scale attentive interaction networks for chinese medical question answer selection. IEEE Access, 6:74061–74071, 2018. doi:10.1109/ACCESS.2018.2883637.
Citations (3)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces a multi-task hybrid loss training method that effectively manages retrieval, classification, and semantic similarity tasks.
  • It scales embedding dimensions from 768 to 1792 and employs Matryoshka Representation Learning for flexible, robust performance.
  • The model achieves superior results on the CMTEB benchmark, demonstrating its strong capability in clustering and nuanced text similarity.

Understanding Piccolo2: Advancements in Chinese Text Embeddings

Introduction to Text Embeddings

Text embeddings are a cornerstone in the field of NLP. They convert textual content into a numerical format that machines can understand, maintaining semantic meanings in a lower-dimensional space. Utilized in various applications like sentiment analysis, document retrieval, and more, embeddings are crucial for effectively handling and processing language data.

Piccolo2 emerges as an advancement in this arena, particularly focusing on multi-task applications and optimizing performance across heterogeneous tasks. It employs a new training methodology distinct from the typical single-task-focused models.

Training Enhancements in Piccolo2

Multi-Task Hybrid Loss Training

Piccolo2 revolutionizes the standard training process by incorporating a multi-task hybrid loss training approach. This methodology is tailored to accommodate diverse requirements of different NLP tasks such as retrieval, classification, and sentence similarity—each having unique nuances and demands.

  • Retrieval Tasks: Standard practices utilize InfoNCE loss, leveraging in-batch negatives to enhance learning from each query-document pair.
  • Semantic Text Similarity (STS) and Classification: Here, finer distinctions in text pairs are crucial. Piccolo2 employs a cosent loss function designed specifically for fine-grained label differentiation, providing a significant boost in performance for STS tasks.
  • Clustering and Category Classification: These are tackled using an innovative contrastive learning format, transforming classification labels into contrastive pairs, simplifying the learning process.

This hybrid loss approach enables Piccolo2 to perform robustly across a broad spectrum of tasks, as demonstrated by its leading results on the CMTEB benchmark.

Dimension Scaling and MRL Training

Increasing the embedding dimension was another strategy used. By scaling up from 768 to 1792 dimensions, the model essentially expands its capacity to encapsulate more information within each embedding.

Moreover, Matryoshka Representation Learning (MRL) is applied, which supports embeddings of variable lengths. This offers flexibility in deployment environments where computational resources or latency constraints might vary. The ability to maintain high performance even with reduced dimensionality underscores the efficiency and technical sophistication of Piccolo2.

Data Strategy and Benchmark Performance

Data Synthesis and Hard Negative Mining

Piccolo2's training leverages both synthetic and mined hard negatives, enhancing the model’s ability to discern subtle nuances in text similarity and relevance. Through techniques in synthetic data generation and strategic hard negative sampling, the model is exposed to a diverse array of training scenarios, further robustifying its performance.

Benchmarking Against CMTEB

When assessed on the CMTEB benchmark, which evaluates models across six different tasks, Piccolo2 demonstrated superior performance, particularly enhancing results in classification and clustering tasks. This validates the effectiveness of the multi-task hybrid training approach and the utility of hard negative mining in real-world application scenarios.

Future Directions and Conclusions

Piccolo2's success on the CMTEB benchmark is just the starting point. With its flexible, high-capacity embedding and multi-task oriented training approach, it sets a new standard for text embedding models, especially in handling Chinese language data.

Potential future work could explore the integration of unsupervised or self-supervised learning elements to further refine the embedding qualities without heavy reliance on labeled data. Additionally, extending these methodologies to other languages could see Piccolo2's applications becoming globally pertinent.

In conclusion, Piccolo2 represents a significant step forward in text embedding technology, providing a robust, scalable solution tailored for the intricate demands of multiple NLP tasks. Its development not only enhances performance metrics but also broadens the potential for real-world applications of Chinese NLP technologies.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube