- The paper introduces Contrastive Sparse Representation (CSR), a novel method applying sparse coding to pre-trained embeddings for adaptive representation learning.
- CSR achieves superior accuracy and faster retrieval across modalities compared to Matryoshka Representation Learning (MRL), preserving performance using sparse vectors with only a few active neurons.
- The CSR method combines lightweight sparse autoencoding with contrastive objectives, enabling efficient training without requiring full network retraining, unlike MRL.
The paper introduces Contrastive Sparse Representation (CSR), a novel method for adaptive representation learning that leverages sparse coding to enhance the efficiency and fidelity of deep embeddings. CSR addresses the limitations of Matryoshka Representation Learning (MRL), which requires full model retraining and suffers from performance degradation at shorter embedding lengths.
CSR aims to sparsify pre-trained embeddings into a high-dimensional, selectively activated feature space, preserving semantic quality while enabling cost-effective inference at various sparsity levels. The method combines lightweight autoencoding with task-aware contrastive objectives, achieving superior accuracy and retrieval speed compared to MRL, while significantly reducing training time.
Here's a breakdown of the key aspects and claims:
- Adaptive Representation via Sparse Coding: The paper posits sparse coding as a compelling alternative to MRL for achieving adaptive representation. Instead of truncating representation length, CSR employs sparse vectors and sparse matrix factorization to attain computational efficiency. The core idea is to sparsify a full representation, activating only a subset of neurons (K) relevant to the task. The authors claim that even a small number of activated neurons (e.g., 4 to 16) can preserve the performance of a much longer dense representation (e.g., 2048 dimensions), contrasting with MRL embeddings that exhibit substantial quality deterioration at such short lengths.
- Efficiency and Fidelity: By using sparse vector formats, CSR facilitates efficient storage and retrieval, with a complexity order of O(K), where K is the number of activated neurons. The authors argue that MRL requires a longer representation length to attain similar accuracy, resulting in slower inference speeds.
- No Retraining Requirement: A notable advantage of CSR is that it eliminates the need to retrain the entire network. CSR leverages recent advances in training sparse autoencoders (SAEs) to train a lightweight 2-layer Multilayer Perceptron (MLP) module for sparsifying pre-trained embeddings. The paper states this process can be completed in a fraction of the time required by MRL.
- Contrastive Sparse Representation Learning: The method combines contrastive retrieval and reconstructive autoencoding objectives to preserve feature semantics and optimize performance for retrieval tasks.
- Methodology:
- The CSR framework projects a pre-trained embedding v∈Rd into a sparse representation space Rh, where d is the dimension of the pre-trained embedding and h is the dimension of the hidden space.
- The hidden space is regularized using a reconstruction-based sparse compression loss.
- A non-negative contrastive loss is introduced to expand model capacity, supported by theoretical motivations.
- The method uses sparse autoencoders (SAEs) to reduce data size by preserving only the most essential components.
The encoder and decoder of the SAE are defined as:
zk​=TopK(Wenc​(f(x)−bpre​)+benc​)
f(x)k​=Wdec​zk​+bpre​
where:
- zk​ is the sparse representation with the top K values.
- Wenc​∈Rh×d is the encoder weight matrix.
- f(x)∈Rd is the input embedding vector.
- bpre​∈Rd is a bias vector.
- benc​∈Rh is the encoder bias vector.
- Wdec​∈Rd×h is the decoder weight matrix.
- A loss function is formulated as:
L(k)=∣∣f(x)−f^​(x)k​∣∣22​
where:
- L(k) is the loss function.
- f^​(x)k​ is the reconstructed embedding using the top k latents.
- An auxiliary loss Laux​ and Multi-TopK losses are proposed to mitigate the problem of "dead latents", where an increasing number of latent dimensions remain inactive during training. The overall reconstruction loss is:
Lrecon​=L(k)+L(kaux​)/8+βLaux​
where:
- Laux​=∣∣e−e^∣∣22​
- e=f(x)−f^​(x)
- e^=Wdec​z is the reconstruction using the top-kaux​ dead latents.
- kaux​=512 by default.
- β=1/32 by default.
- A contrastive loss objective is formulated as:
Lncl​=B1​i=1∑B​−log∑jî€ =i​exp(zi​⋅zj​)exp(zi​⋅zi​)​
where:
- Lncl​ is the non-negative contrastive loss.
- zi​ represents the latent variables in sparse autoencoders.
- The final loss function is:
Ltotal​=Lrecon​+γLncl​
where:
- γ is a hyperparameter that balances the two loss components and is set to 1 by default.
- Experimental Results and Claims: The paper presents extensive experiments on image, text, and multimodal benchmarks.
- ImageNet Classification: CSR is reported to rival MRL's performance by 17% on ImageNet classification under the same compute budget. It also delivers a 69x speedup on ImageNet1k 1-Nearest Neighbor (1-NN) tasks without compromising performance compared to quantization-based approaches.
- MTEB Text Retrieval: CSR outperforms MRL by 15% on the MTEB benchmark under the same compute budget.
- MS COCO Retrieval: CSR outperforms MRL by 7% on MS COCO retrieval under the same compute budget.
- Retrieval Time Analysis: The experiments analyze the impact of hidden dimension, database size, and sparsity on retrieval efficiency. The results suggest that higher sparsity enables more effective utilization of sparse matrix operations, especially for large-scale embeddings.
- Effect of Input Embedding Dimension: The paper finds that as the input embedding dimension increases, performance degradation diminishes.
- Effect of Hidden Representation Dimension: The experiments reveal that performance peaks at a hidden dimension roughly 4x the input dimension (h=4d), with performance degrading beyond this point.
- Ablation Studies: Ablation studies demonstrate the impact of different loss functions on model capacity, particularly in addressing the dead latents problem.
- Benchmark Results: The paper evaluates CSR across vision, language, and vision+language modalities, comparing it against state-of-the-art efficient embedding models. The results indicate that CSR consistently outperforms MRL and its variants in terms of accuracy and efficiency.
- For vision representation, CSR achieves comparable performance to the full representation of ResNet50, with only a slight decrease in 1-NN accuracy, demonstrating the effectiveness of CSR in compressing pre-trained embeddings while leveraging sparse matrix multiplication.
- For text representation, CSR maintains the strong performance of the pre-trained model and surpasses baselines under varying resource constraints. It achieves a 61x speedup at matched retrieval efficiency and a 15% performance improvement at matched computational cost.
- For multimodal representation, CSR achieves average performance gains of 4.6% and 6.8% on image-to-text retrieval, and 9.1% and 6.5% on text-to-image retrieval compared to MRL across the MS COCO and Flickr30K datasets.
In conclusion, the paper introduces CSR as a promising alternative to MRL for adaptive representation learning, offering advantages in fidelity, retrieval cost, and training cost. The method combines sparse coding with contrastive learning and reconstructive autoencoding objectives, achieving competitive performance across different tasks and modalities with significantly lower computational costs.