Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 142 tok/s
Gemini 2.5 Pro 51 tok/s Pro
GPT-5 Medium 28 tok/s Pro
GPT-5 High 25 tok/s Pro
GPT-4o 59 tok/s Pro
Kimi K2 198 tok/s Pro
GPT OSS 120B 430 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

General Embedding Technique

Updated 16 November 2025
  • General embedding technique is a framework that unifies heterogeneous data into a common representation space to enable robust cross-task performance.
  • It employs model merging by fine-tuning task-specific models and combining their parameter displacements through convex optimization to mitigate negative transfer and data imbalance.
  • The Self Positioning algorithm optimizes interpolation coefficients on a balanced probe dataset, leading to improved retrieval and semantic similarity outcomes on benchmark evaluations.

A general embedding technique refers to a methodology or computational framework that allows a system or model to map data from heterogeneous sources or tasks into a shared representation space, typically with the goal of supporting high transferability, robust performance across diverse tasks, or efficient multitask or multimodal integration. The development of such techniques is considered fundamental to scalable and universal AI systems, with notable recent focus on general-purpose text embedding models, their training recipes, architectural designs, and approaches for overcoming common obstacles such as task conflict and data imbalance.

1. Task Conflict and Data Imbalance in General-Purpose Embedding Training

A central challenge in general embedding technique design arises when attempting to produce universal representations by jointly training a model on diverse tasks or datasets. In the context of text embedding, this is exemplified by two issues:

  • Task Conflict (Negative Transfer): When performing joint contrastive training on multiple tasks, the gradients θLtask  iCL\nabla_\theta L^{CL}_{task\;i} and θLtask  jCL\nabla_\theta L^{CL}_{task\;j} from different tasks may point in opposing directions, causing destructive interference in parameter updates. Empirically, this leads to a model trained jointly on, e.g., STS (semantic textual similarity) and retrieval, underperforming single-task baselines on both tasks.
  • Data Imbalance: When tasks have disproportionate dataset sizes (e.g., one task with 10×\times more examples), gradients from the larger task dominate. This skews the learned parameters, reflected in task vectors Vi=θiθ0V_i = \theta_i - \theta_0 that are longer and more aligned with high-data tasks (as measured by vector norms and angles in parameter space), thus introducing bias that negatively affects underrepresented tasks.

On the Massive Text Embedding Benchmark (MTEB)—comprising 56 datasets and seven scenario types—these effects manifest as a 1–2 point average loss on retrieval and STS sub-benchmarks for joint-trained models compared to task-specialized models.

2. Model Merging: Formal Framework and Interpolation Space

To address these challenges, the model merging framework proceeds as follows:

  • Backbone Initialization: Start with a common pretrained model with parameters θ0\theta_0.
  • Task-Specific Training: Independently fine-tune NN copies of the model {Mi}i=1N\{M_i\}_{i=1}^N on their respective tasks {Di}i=1N\{D_i\}_{i=1}^N, producing final weights {θi}\{\theta_i\}.
  • Task Vectors: For each task, the displacement vector in parameter space is Vi=θiθ0V_i = \theta_i - \theta_0.
  • Merged Model Parameterization: The merged model is an affine combination within the convex hull of the task vectors:

θ(α,λ)=θ0+λi=1NαiVi,\theta(\alpha, \lambda) = \theta_0 + \lambda \sum_{i=1}^N \alpha_i V_i,

with αi0\alpha_i \ge 0, i=1Nαi=1\sum_{i=1}^N \alpha_i = 1 (so α\alpha lies on the (N1)(N-1)-simplex), and λ\lambda is a positive scaling parameter (regularized to stay near 1 for stability).

The dual-tower encoder M(α,λ)=f(θ(α,λ))M(\alpha,\lambda) = f(\theta(\alpha, \lambda)) uses these merged weights for downstream tasks.

This interpolation space forms a low-dimensional polytope whose geometry reflects the directions and magnitudes of transfer for each task, and is exploited to find an optimal fused model.

3. Self Positioning Algorithm: Optimization in Model Space

The Self Positioning algorithm is designed to efficiently solve for coefficients (α^,λ^)(\hat \alpha, \hat \lambda) that place the merged model in an empirically optimal region. The steps are:

  • Probe Dataset Dt\mathcal{D}_t Selection: Assemble a small, balanced set of examples (e.g., 3200032\,000 items, 50\% STS/50\% retrieval).
  • Objective Function:

(α^,λ^)=argminα,λ1Dt(q,p,Pn)DtLCL(θ0+λi=1NαiVi;q,p,Pn)+μλ,(\hat \alpha, \hat \lambda) = \arg\min_{\alpha, \lambda} \frac{1}{|\mathcal{D}_t|} \sum_{(q, p, P^n) \in \mathcal{D}_t} L^{CL}\big(\theta_0 + \lambda \sum_{i=1}^N \alpha_i V_i; q, p, P^n\big) + \mu \lambda,

with μ\mu a small regularizer ($0.00$–$0.10$) to prevent λ\lambda\to\infty.

  • Optimization Procedure:
    • Initialize αi1/N\alpha_i \gets 1/N, λ1\lambda \gets 1.
    • For TT steps: sample a batch from Dt\mathcal{D}_t, compute the contrastive in-batch softmax loss under the interpolated weights, compute gradients with respect to αi\alpha_i and λ\lambda, and take an Adam/SGD step.
    • After each update, project α\alpha onto the simplex (αi0\alpha_i \geq 0, αi=1\sum \alpha_i = 1).

This procedure (1,000 steps, batch size 32, using Adam with learning rate 5×1035\times 10^{-3}) is computationally negligible compared to the cost of model training.

4. Implementation Architecture and Data Regime

  • Backbones and Architectures: For controlled ablation, the approach is demonstrated on dual-tower BERTbase_\text{base} for mono-lingual and bilingual tasks (STS/Retrieval, English and Chinese). For large-scale application, T5-large encoder is used, instruction-tuned on 330 tasks, with a contrastive loss (temperature τ=0.02\tau=0.02).
  • Task Data: STS (AllNLI for English, SimCLUE for Chinese), Retrieval (FEVER, HotpotQA, MS MARCO, NQ, etc.) with batches drawn per task for joint training baselines. The probe set for Self Positioning is always balanced between critical tasks.
  • Computational Cost: Training NN task-specific models has runtime and compute comparable to NN parallel fine-tunes. The Self Positioning search requires \sim30 minutes (1,000 steps) and is thus negligible in total expense compared to multi-task joint training (which can require multiple days on A100 GPUs).

5. Quantitative Performance and Comparison to Baselines

  • MTEB Benchmark Evaluation: The merged model, when positioned via Self Positioning, yields an absolute +0.7 improvement over joint-training baselines (e.g., 60.5% vs 60.1% average English score; 52.9% vs 51.2% Chinese).
  • Alternative Merging Strategies: Using SLERP (spherical interpolation) or hand-crafted α\alpha coefficients achieves slightly smaller improvements.
  • Comparison to Resampling: The model merging approach outperforms simple data resampling by 0.5–1.0 in average multi-task score and is 5–10×\times faster.
Approach Avg EN Score Avg ZH Score Computational Cost
Joint Training 60.1% 51.2% High (days on A100)
SLERP/tuned-α merging ~60.5% ~52.9% Moderate
Self Positioning (full) 60.5% 52.9% Low (minutes)
Resampling (simple) ~59.5–60.0% ~52% High

The improvement is strict across both languages and major subtasks, with no observed regressions relative to single-task training, mitigating both negative transfer and imbalance.

6. Theoretical and Practical Rationale

  • Independent Task Training Avoids Gradient Collision: By never mixing gradients between tasks, the procedure eliminates the source of negative transfer.
  • Convex Combination for Data Balance: The α\alpha simplex coordinates reweight task influence post hoc, removing bias without dropping any training examples.
  • Low-Dimensional Interpolation: The number of merged degrees of freedom equals the number of tasks, so the optimization problem is small-scale and converges quickly.
  • Efficiency: Separate task training and fast simplex search decouple performance from training-set scale and mitigate expensive joint optimization dynamics.

7. Extensions, Generalizability, and Open Questions

  • Vision and Multimodal Merging: The general merging procedure is not limited to text; it can be used to combine CNN/ViT encoders trained on generic and specialized visual domains or to fuse encoders across different modalities (e.g., text and image).
  • Beyond Linear Interpolation: Future work may focus on non-linear merging strategies (e.g., low-rank adapters, manifold-based or group-theoretic interpolations), and automatic selection or dynamic adaptation of the probe set Dt\mathcal{D}_t for maximal cross-task transfer.
  • Landscape Characterization: Theoretical paper of the convexity and topology of the model interpolation landscape in deep parameter spaces remains an important open direction.

Summary

A general embedding technique, as instantiated by model merging with Self Positioning (Li et al., 19 Oct 2024), leverages task-specific independence and principled parameter interpolation to produce robust, general-purpose embeddings. By balancing contributions from each task in the merged model through low-dimensional convex optimization, it circumvents the fundamental limitations of multi-task joint training—namely, gradient interference and data imbalance—while improving both computational efficiency and empirical performance on standardized benchmarks. This paradigm admits natural extensions to computer vision and multimodal settings and raises a set of important theoretical and practical questions for future research.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to General Embedding Technique.