General Embedding Technique
- General embedding technique is a framework that unifies heterogeneous data into a common representation space to enable robust cross-task performance.
- It employs model merging by fine-tuning task-specific models and combining their parameter displacements through convex optimization to mitigate negative transfer and data imbalance.
- The Self Positioning algorithm optimizes interpolation coefficients on a balanced probe dataset, leading to improved retrieval and semantic similarity outcomes on benchmark evaluations.
A general embedding technique refers to a methodology or computational framework that allows a system or model to map data from heterogeneous sources or tasks into a shared representation space, typically with the goal of supporting high transferability, robust performance across diverse tasks, or efficient multitask or multimodal integration. The development of such techniques is considered fundamental to scalable and universal AI systems, with notable recent focus on general-purpose text embedding models, their training recipes, architectural designs, and approaches for overcoming common obstacles such as task conflict and data imbalance.
1. Task Conflict and Data Imbalance in General-Purpose Embedding Training
A central challenge in general embedding technique design arises when attempting to produce universal representations by jointly training a model on diverse tasks or datasets. In the context of text embedding, this is exemplified by two issues:
- Task Conflict (Negative Transfer): When performing joint contrastive training on multiple tasks, the gradients and from different tasks may point in opposing directions, causing destructive interference in parameter updates. Empirically, this leads to a model trained jointly on, e.g., STS (semantic textual similarity) and retrieval, underperforming single-task baselines on both tasks.
- Data Imbalance: When tasks have disproportionate dataset sizes (e.g., one task with 10 more examples), gradients from the larger task dominate. This skews the learned parameters, reflected in task vectors that are longer and more aligned with high-data tasks (as measured by vector norms and angles in parameter space), thus introducing bias that negatively affects underrepresented tasks.
On the Massive Text Embedding Benchmark (MTEB)—comprising 56 datasets and seven scenario types—these effects manifest as a 1–2 point average loss on retrieval and STS sub-benchmarks for joint-trained models compared to task-specialized models.
2. Model Merging: Formal Framework and Interpolation Space
To address these challenges, the model merging framework proceeds as follows:
- Backbone Initialization: Start with a common pretrained model with parameters .
- Task-Specific Training: Independently fine-tune copies of the model on their respective tasks , producing final weights .
- Task Vectors: For each task, the displacement vector in parameter space is .
- Merged Model Parameterization: The merged model is an affine combination within the convex hull of the task vectors:
with , (so lies on the -simplex), and is a positive scaling parameter (regularized to stay near 1 for stability).
The dual-tower encoder uses these merged weights for downstream tasks.
This interpolation space forms a low-dimensional polytope whose geometry reflects the directions and magnitudes of transfer for each task, and is exploited to find an optimal fused model.
3. Self Positioning Algorithm: Optimization in Model Space
The Self Positioning algorithm is designed to efficiently solve for coefficients that place the merged model in an empirically optimal region. The steps are:
- Probe Dataset Selection: Assemble a small, balanced set of examples (e.g., items, 50\% STS/50\% retrieval).
- Objective Function:
with a small regularizer ($0.00$–$0.10$) to prevent .
- Optimization Procedure:
- Initialize , .
- For steps: sample a batch from , compute the contrastive in-batch softmax loss under the interpolated weights, compute gradients with respect to and , and take an Adam/SGD step.
- After each update, project onto the simplex (, ).
This procedure (1,000 steps, batch size 32, using Adam with learning rate ) is computationally negligible compared to the cost of model training.
4. Implementation Architecture and Data Regime
- Backbones and Architectures: For controlled ablation, the approach is demonstrated on dual-tower BERT for mono-lingual and bilingual tasks (STS/Retrieval, English and Chinese). For large-scale application, T5-large encoder is used, instruction-tuned on 330 tasks, with a contrastive loss (temperature ).
- Task Data: STS (AllNLI for English, SimCLUE for Chinese), Retrieval (FEVER, HotpotQA, MS MARCO, NQ, etc.) with batches drawn per task for joint training baselines. The probe set for Self Positioning is always balanced between critical tasks.
- Computational Cost: Training task-specific models has runtime and compute comparable to parallel fine-tunes. The Self Positioning search requires 30 minutes (1,000 steps) and is thus negligible in total expense compared to multi-task joint training (which can require multiple days on A100 GPUs).
5. Quantitative Performance and Comparison to Baselines
- MTEB Benchmark Evaluation: The merged model, when positioned via Self Positioning, yields an absolute +0.7 improvement over joint-training baselines (e.g., 60.5% vs 60.1% average English score; 52.9% vs 51.2% Chinese).
- Alternative Merging Strategies: Using SLERP (spherical interpolation) or hand-crafted coefficients achieves slightly smaller improvements.
- Comparison to Resampling: The model merging approach outperforms simple data resampling by 0.5–1.0 in average multi-task score and is 5–10 faster.
| Approach | Avg EN Score | Avg ZH Score | Computational Cost |
|---|---|---|---|
| Joint Training | 60.1% | 51.2% | High (days on A100) |
| SLERP/tuned-α merging | ~60.5% | ~52.9% | Moderate |
| Self Positioning (full) | 60.5% | 52.9% | Low (minutes) |
| Resampling (simple) | ~59.5–60.0% | ~52% | High |
The improvement is strict across both languages and major subtasks, with no observed regressions relative to single-task training, mitigating both negative transfer and imbalance.
6. Theoretical and Practical Rationale
- Independent Task Training Avoids Gradient Collision: By never mixing gradients between tasks, the procedure eliminates the source of negative transfer.
- Convex Combination for Data Balance: The simplex coordinates reweight task influence post hoc, removing bias without dropping any training examples.
- Low-Dimensional Interpolation: The number of merged degrees of freedom equals the number of tasks, so the optimization problem is small-scale and converges quickly.
- Efficiency: Separate task training and fast simplex search decouple performance from training-set scale and mitigate expensive joint optimization dynamics.
7. Extensions, Generalizability, and Open Questions
- Vision and Multimodal Merging: The general merging procedure is not limited to text; it can be used to combine CNN/ViT encoders trained on generic and specialized visual domains or to fuse encoders across different modalities (e.g., text and image).
- Beyond Linear Interpolation: Future work may focus on non-linear merging strategies (e.g., low-rank adapters, manifold-based or group-theoretic interpolations), and automatic selection or dynamic adaptation of the probe set for maximal cross-task transfer.
- Landscape Characterization: Theoretical paper of the convexity and topology of the model interpolation landscape in deep parameter spaces remains an important open direction.
Summary
A general embedding technique, as instantiated by model merging with Self Positioning (Li et al., 19 Oct 2024), leverages task-specific independence and principled parameter interpolation to produce robust, general-purpose embeddings. By balancing contributions from each task in the merged model through low-dimensional convex optimization, it circumvents the fundamental limitations of multi-task joint training—namely, gradient interference and data imbalance—while improving both computational efficiency and empirical performance on standardized benchmarks. This paradigm admits natural extensions to computer vision and multimodal settings and raises a set of important theoretical and practical questions for future research.