General Embedding Technique

Updated 16 November 2025

General embedding technique is a framework that unifies heterogeneous data into a common representation space to enable robust cross-task performance.
It employs model merging by fine-tuning task-specific models and combining their parameter displacements through convex optimization to mitigate negative transfer and data imbalance.
The Self Positioning algorithm optimizes interpolation coefficients on a balanced probe dataset, leading to improved retrieval and semantic similarity outcomes on benchmark evaluations.

A general embedding technique refers to a methodology or computational framework that allows a system or model to map data from heterogeneous sources or tasks into a shared representation space, typically with the goal of supporting high transferability, robust performance across diverse tasks, or efficient multitask or multimodal integration. The development of such techniques is considered fundamental to scalable and universal AI systems, with notable recent focus on general-purpose text embedding models, their training recipes, architectural designs, and approaches for overcoming common obstacles such as task conflict and data imbalance.

1. Task Conflict and Data Imbalance in General-Purpose Embedding Training

A central challenge in general embedding technique design arises when attempting to produce universal representations by jointly training a model on diverse tasks or datasets. In the context of text embedding, this is exemplified by two issues:

Task Conflict (Negative Transfer): When performing joint contrastive training on multiple tasks, the gradients $\nabla_\theta L^{CL}_{task\;i}$ and $\nabla_\theta L^{CL}_{task\;j}$ from different tasks may point in opposing directions, causing destructive interference in parameter updates. Empirically, this leads to a model trained jointly on, e.g., STS (semantic textual similarity) and retrieval, underperforming single-task baselines on both tasks.
Data Imbalance: When tasks have disproportionate dataset sizes (e.g., one task with 10 $\times$ more examples), gradients from the larger task dominate. This skews the learned parameters, reflected in task vectors $V_i = \theta_i - \theta_0$ that are longer and more aligned with high-data tasks (as measured by vector norms and angles in parameter space), thus introducing bias that negatively affects underrepresented tasks.

On the Massive Text Embedding Benchmark (MTEB)—comprising 56 datasets and seven scenario types—these effects manifest as a 1–2 point average loss on retrieval and STS sub-benchmarks for joint-trained models compared to task-specialized models.

2. Model Merging: Formal Framework and Interpolation Space

To address these challenges, the model merging framework proceeds as follows:

Backbone Initialization: Start with a common pretrained model with parameters $\theta_0$ .
Task-Specific Training: Independently fine-tune $N$ copies of the model $\{M_i\}_{i=1}^N$ on their respective tasks $\{D_i\}_{i=1}^N$ , producing final weights $\{\theta_i\}$ .
Task Vectors: For each task, the displacement vector in parameter space is $V_i = \theta_i - \theta_0$ .
Merged Model Parameterization: The merged model is an affine combination within the convex hull of the task vectors:

$\theta(\alpha, \lambda) = \theta_0 + \lambda \sum_{i=1}^N \alpha_i V_i,$

with $\alpha_i \ge 0$ , $\sum_{i=1}^N \alpha_i = 1$ (so $\alpha$ lies on the $(N-1)$ -simplex), and $\lambda$ is a positive scaling parameter (regularized to stay near 1 for stability).

The dual-tower encoder $M(\alpha,\lambda) = f(\theta(\alpha, \lambda))$ uses these merged weights for downstream tasks.

This interpolation space forms a low-dimensional polytope whose geometry reflects the directions and magnitudes of transfer for each task, and is exploited to find an optimal fused model.

3. Self Positioning Algorithm: Optimization in Model Space

The Self Positioning algorithm is designed to efficiently solve for coefficients $(\hat \alpha, \hat \lambda)$ that place the merged model in an empirically optimal region. The steps are:

Probe Dataset $\mathcal{D}_t$ Selection: Assemble a small, balanced set of examples (e.g., $32\,000$ items, 50\% STS/50\% retrieval).
Objective Function:

$(\hat \alpha, \hat \lambda) = \arg\min_{\alpha, \lambda} \frac{1}{|\mathcal{D}_t|} \sum_{(q, p, P^n) \in \mathcal{D}_t} L^{CL}\big(\theta_0 + \lambda \sum_{i=1}^N \alpha_i V_i; q, p, P^n\big) + \mu \lambda,$

with $\mu$ a small regularizer ($0.00$–$0.10$) to prevent $\lambda\to\infty$ .

Optimization Procedure:
- Initialize $\alpha_i \gets 1/N$ , $\lambda \gets 1$ .
- For $T$ steps: sample a batch from $\mathcal{D}_t$ , compute the contrastive in-batch softmax loss under the interpolated weights, compute gradients with respect to $\alpha_i$ and $\lambda$ , and take an Adam/SGD step.
- After each update, project $\alpha$ onto the simplex ( $\alpha_i \geq 0$ , $\sum \alpha_i = 1$ ).

This procedure (1,000 steps, batch size 32, using Adam with learning rate $5\times 10^{-3}$ ) is computationally negligible compared to the cost of model training.

4. Implementation Architecture and Data Regime

Backbones and Architectures: For controlled ablation, the approach is demonstrated on dual-tower BERT $_\text{base}$ for mono-lingual and bilingual tasks (STS/Retrieval, English and Chinese). For large-scale application, T5-large encoder is used, instruction-tuned on 330 tasks, with a contrastive loss (temperature $\tau=0.02$ ).
Task Data: STS (AllNLI for English, SimCLUE for Chinese), Retrieval (FEVER, HotpotQA, MS MARCO, NQ, etc.) with batches drawn per task for joint training baselines. The probe set for Self Positioning is always balanced between critical tasks.
Computational Cost: Training $N$ task-specific models has runtime and compute comparable to $N$ parallel fine-tunes. The Self Positioning search requires $\sim$ 30 minutes (1,000 steps) and is thus negligible in total expense compared to multi-task joint training (which can require multiple days on A100 GPUs).

5. Quantitative Performance and Comparison to Baselines

MTEB Benchmark Evaluation: The merged model, when positioned via Self Positioning, yields an absolute +0.7 improvement over joint-training baselines (e.g., 60.5% vs 60.1% average English score; 52.9% vs 51.2% Chinese).
Alternative Merging Strategies: Using SLERP (spherical interpolation) or hand-crafted $\alpha$ coefficients achieves slightly smaller improvements.
Comparison to Resampling: The model merging approach outperforms simple data resampling by 0.5–1.0 in average multi-task score and is 5–10 $\times$ faster.

Approach	Avg EN Score	Avg ZH Score	Computational Cost
Joint Training	60.1%	51.2%	High (days on A100)
SLERP/tuned-α merging	~60.5%	~52.9%	Moderate
Self Positioning (full)	60.5%	52.9%	Low (minutes)
Resampling (simple)	~59.5–60.0%	~52%	High

The improvement is strict across both languages and major subtasks, with no observed regressions relative to single-task training, mitigating both negative transfer and imbalance.

6. Theoretical and Practical Rationale

Independent Task Training Avoids Gradient Collision: By never mixing gradients between tasks, the procedure eliminates the source of negative transfer.
Convex Combination for Data Balance: The $\alpha$ simplex coordinates reweight task influence post hoc, removing bias without dropping any training examples.
Low-Dimensional Interpolation: The number of merged degrees of freedom equals the number of tasks, so the optimization problem is small-scale and converges quickly.
Efficiency: Separate task training and fast simplex search decouple performance from training-set scale and mitigate expensive joint optimization dynamics.

7. Extensions, Generalizability, and Open Questions

Vision and Multimodal Merging: The general merging procedure is not limited to text; it can be used to combine CNN/ViT encoders trained on generic and specialized visual domains or to fuse encoders across different modalities (e.g., text and image).
Beyond Linear Interpolation: Future work may focus on non-linear merging strategies (e.g., low-rank adapters, manifold-based or group-theoretic interpolations), and automatic selection or dynamic adaptation of the probe set $\mathcal{D}_t$ for maximal cross-task transfer.
Landscape Characterization: Theoretical paper of the convexity and topology of the model interpolation landscape in deep parameter spaces remains an important open direction.

Summary

A general embedding technique, as instantiated by model merging with Self Positioning (Li et al., 19 Oct 2024), leverages task-specific independence and principled parameter interpolation to produce robust, general-purpose embeddings. By balancing contributions from each task in the merged model through low-dimensional convex optimization, it circumvents the fundamental limitations of multi-task joint training—namely, gradient interference and data imbalance—while improving both computational efficiency and empirical performance on standardized benchmarks. This paradigm admits natural extensions to computer vision and multimodal settings and raises a set of important theoretical and practical questions for future research.

PDF Markdown Chat (Pro)

References (1)

Improving General Text Embedding Model: Tackling Task Conflict and Data Imbalance through Model Merging (2024)

Follow Topic

Get notified by email when new papers are published related to General Embedding Technique.