Siamese Sharing Dual Encoder (SDE)

Updated 26 March 2026

Siamese Sharing (SDE) is a dual encoder architecture featuring complete weight sharing across both towers to create a unified semantic space.
It employs a shared Transformer encoder and projection layer to achieve symmetric query and context representations, enhancing retrieval performance.
Empirical evaluations on QA and IR tasks demonstrate that SDE outperforms asymmetric models with improved precision and recall metrics.

Siamese Sharing (SDE), formally known as the Siamese Dual Encoder architecture, is a dual-tower neural retrieval model characterized by strict parameter sharing across both encoder towers. SDE is prominently used in question-answering (QA) and information retrieval (IR) systems, where it has demonstrated superior performance relative to dual-encoder variants with asymmetric or partially shared parameters. The SDE applies the same token embedding layer, Transformer encoder stack, and projection layer weights to both the query (e.g., a question) and the context (e.g., a candidate answer or passage), enforcing a unified embedding space optimal for tasks requiring semantic alignment between distinct modalities (Dong et al., 2022).

The SDE model consists of two identical towers, each processing distinct inputs but sharing all model parameters:

Encoder Backbone: Both towers use a Transformer encoder (T5 1.1 in small, base, or large configurations). The hidden-state dimension $D_h$ varies with model size: 512 (small), 768 (base), or 1024 (large).
Pooling and Projection: Final hidden states from the Transformer are averaged, producing representations $h_q, h_a \in \mathbb{R}^{D_h}$ . These are projected to the retrieval embedding space of dimension $D_e$ (typically $D_e = D_h$ ) via a shared linear layer: $v_x = W_{\mathrm{proj}} h_x + b_{\mathrm{proj}}$ , with $W_{\mathrm{proj}} \in \mathbb{R}^{D_e \times D_h}, b_{\mathrm{proj}} \in \mathbb{R}^{D_e}$ .
Parameter Sharing: All components are shared:
- Token embeddings
- Transformer weights
- Projection layer ( $W_{\mathrm{proj}}, b_{\mathrm{proj}}$ )

This strict sharing enforces that both question and answer encoders generate aligned representations.

2. Formal Mathematical Description

Let $q$ denote a query and $a$ a candidate context. The towers compute:

Encoding:

$h_q = \mathrm{Encoder}(q) \in \mathbb{R}^{D_h}$

$h_q, h_a \in \mathbb{R}^{D_h}$ 0

Projection:

$h_q, h_a \in \mathbb{R}^{D_h}$ 1

$h_q, h_a \in \mathbb{R}^{D_h}$ 2

Resulting scoring towers:

$h_q, h_a \in \mathbb{R}^{D_h}$ 3

The towers use fully shared parameters, guaranteeing representation symmetry and optimal use of inductive biases within the embedding space (Dong et al., 2022).

3. Similarity Scoring and Objective Function

The SDE employs similarity functions for ranking:

Dot product:

$h_q, h_a \in \mathbb{R}^{D_h}$ 4

Cosine similarity:

$h_q, h_a \in \mathbb{R}^{D_h}$ 5

Training utilizes in-batch negatives with a softmax-contrastive loss:

$h_q, h_a \in \mathbb{R}^{D_h}$ 6

where $h_q, h_a \in \mathbb{R}^{D_h}$ 7 is a temperature hyperparameter (empirically, $h_q, h_a \in \mathbb{R}^{D_h}$ 8).

4. Empirical Evaluation and Comparative Results

Across QA and IR tasks (MS MARCO, open-domain Natural Questions (NQ), MultiReQA), the SDE consistently outperforms Asymmetric Dual Encoder (ADE) baselines and most parameter-sharing ablations when equivalent training protocol and model size are held constant.

Summary of Key Results (T5-base, $h_q, h_a \in \mathbb{R}^{D_h}$ 9):

Task	Metric	SDE	ADE	ADE-SPL
MS MARCO (dev)	P@1	15.92%	14.20%	15.46%
	MRR@10	28.49%	26.31%	28.20%
Open-domain NQ	Top-5 acc.	62.2%	57.6%	62.7%
	Top-20 acc.	77.0%	73.2%	76.4%
	Top-100 acc.	84.6%	82.7%	84.4%
MultiReQA (SQuAD)	P@1 / MRR	70.13/78.44	60.39/70.33	69.39/77.65

Here, P@1 denotes Precision@1; MRR denotes Mean Reciprocal Rank.
ADE-SPL (ADE+Shared Projection Layer) approaches SDE’s performance across most tasks.
Sharing only the token embedder or freezing it (ADE-STE, ADE-FTE) yields small incremental improvements over ADE.

5. Analysis of Embedding Spaces

t-SNE visualization of embedding spaces provides additional evidence:

Without shared projection (ADE, ADE-STE, ADE-FTE), question and answer embeddings form two nearly disjoint clusters in 2D t-SNE space. This suggests a significant semantic misalignment between towers, reducing retrieval quality.
With shared projection (SDE, ADE-SPL), question and answer embeddings overlap and intermingle in the t-SNE space, reflecting a common semantic space and directly corresponding to improved retrieval metrics (Dong et al., 2022).

6. Strengths, Limitations, and Practical Recommendations

Strengths of SDE:

Parameter sharing tightly enforces a joint semantic embedding space.
Consistently superior empirical retrieval performance is observed across tasks and model sizes.

Limitations:

Cross-tower interactions are limited to scoring (dot/cosine); the architecture does not model richer cross-input attention or late interaction.
The analysis is restricted to dual-encoder models; hybrid or late-interaction methods are not addressed.

Practical Guidance:

SDE, with total weight sharing, is the empirically preferred design for homogeneous dual-encoder QA and IR settings.
If strict symmetry is impractical (e.g., for highly dissimilar input modalities), sharing at least the projection layer (ADE-SPL) can recover most of the performance gains of full SDE at minimal implementation cost (Dong et al., 2022).

Markdown Report Issue Upgrade to Chat

References (1)

Exploring Dual Encoder Architectures for Question Answering (2022)

Topic to Video (Beta)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Siamese Sharing (SDE).