Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 78 tok/s

Gemini 2.5 Pro 56 tok/s Pro

GPT-5 Medium 33 tok/s Pro

GPT-5 High 21 tok/s Pro

GPT-4o 107 tok/s Pro

Kimi K2 196 tok/s Pro

GPT OSS 120B 436 tok/s Pro

Claude Sonnet 4.5 34 tok/s Pro

2000 character limit reached

Jasper and Stella: distillation of SOTA embedding models (2412.19048v2)

Published 26 Dec 2024 in cs.IR

Abstract: A crucial component in many deep learning applications, such as Frequently Asked Questions (FAQ) and Retrieval-Augmented Generation (RAG), is dense retrieval. In this process, embedding models transform raw text into numerical vectors. However, the embedding models that currently excel on text embedding benchmarks, like the Massive Text Embedding Benchmark (MTEB), often have numerous parameters and high vector dimensionality. This poses challenges for their application in real-world scenarios. To address this issue, we propose a novel multi-stage distillation framework that enables a smaller student embedding model to distill multiple larger teacher embedding models through three carefully designed losses. Meanwhile, we utilize Matryoshka Representation Learning (MRL) to reduce the vector dimensionality of the student embedding model effectively. Our student model named Jasper with 2 billion parameters, built upon the Stella embedding model, obtained the No.3 position on the MTEB leaderboard (as of December 24, 2024), achieving an average 71.54 score across 56 datasets. We have released the model and data on the Hugging Face Hub (https://huggingface.co/infgrad/jasper_en_vision_language_v1) (https://huggingface.co/datasets/infgrad/jasper_text_distill_dataset), and the training codes are available in this project repository (https://github.com/NLPJCL/RAG-Retrieval).

Summary

The paper presents an innovative distillation method that compresses large text embedding models while preserving retrieval accuracy.
It employs specialized loss functions, such as cosine, similarity, and triplet losses, to train on unannotated corpora effectively.
The models achieve competitive MTEB scores and extend to multimodal encoding, demonstrating practical applications in image-text integration.

Overview of "Jasper and Stella: Distillation of SOTA Embedding Models"

The research paper presents a significant contribution to the domain of dense retrieval in deep learning applications, with a focus on optimizing text embedding models through distillation techniques. Two novel models, named Jasper and Stella, are introduced, showcasing a refined approach to distillation for achieving an efficient trade-off between model size and performance. The paper provides a detailed exploration of four primary methodologies that contribute to the effectiveness of these models.

Reducing Model Dimensionality through Distillation

The researchers address the issue of large model size and high dimensionality in state-of-the-art (SOTA) text embedding models, which often hinder deployment in industry, by leveraging a distillation process. This process, inspired by Matryoshka Representation Learning, allows a more compact student model to learn from larger teacher models without substantial loss in performance. The distillation is achieved using designed loss functions such as cosine loss, similarity loss, and triplet loss, facilitating the learning of vectors across various dimensions. Notably, this approach does not rely on supervised data, enhancing its applicability to vast, unannotated text corpora.

Multimodal Encoding Capabilities

Another key innovation discussed is the expansion of Jasper's capabilities to handle multimodal inputs, specifically aligning image and text embeddings. This multimodal encoder, trained with image caption datasets, enables the model to effectively process and understand diverse forms of data, thereby broadening its utility across different applications which require the integration of visual and textual information.

Performance and Evaluation

The paper reports that the Jasper model achieves high scores on the Massive Text Embedding Benchmark (MTEB) leaderboard, particularly excelling among models with fewer than 2 billion parameters. Numerical results showcased in the paper include an average score of 72.02 across a range of tasks, which, while slightly lower than the NV-Embed-v2, still places it among the top contenders. These results validate the model's efficacy in maintaining accuracy despite reduced dimensions and model size.

Practical and Theoretical Implications

The distillation techniques presented have significant implications for both theoretical advancements in machine learning and practical applications in the industry. The method allows for efficient deployment of retrieval models by reducing computational overhead while maintaining robust performance metrics. This process underscores the potential for further research into scalable and efficient model distillation methodologies, potentially influencing the design and deployment of future embedding models and retrieval systems.

Future Developments in AI

As embedding models continue to evolve, the research here suggests several pathways for future exploration. Enhancements in self-distillation techniques could further improve embedding efficiency without supervised data. Additionally, refining multimodal training regimes (outlined as potential improvements in visual encoding phases) holds the promise of more integrated AI systems capable of seamlessly navigating and understanding both linguistic and visual data landscapes.

In conclusion, "Jasper and Stella: Distillation of SOTA Embedding Models" effectively addresses critical challenges in deploying high-performance embedding models by utilizing innovative distillation and training techniques, paving the way for more accessible and efficient applications of deep learning models in real-world scenarios.