Papers
Topics
Authors
Recent
Search
2000 character limit reached

Shared Transformer Encoder

Updated 5 February 2026
  • The Shared Transformer Encoder is a neural architecture that uses a single set of Transformer layers to process inputs from multiple tasks or modalities.
  • It employs techniques such as hard parameter sharing, modality-specific token augmentation, and task-specific output heads to guide learning.
  • Empirical studies show improvements in F1 scores, reduced false positives, and significant parameter reductions in multi-task and multimodal applications.

A shared Transformer encoder is a neural architecture in which a single set of Transformer layers is used jointly by multiple tasks, data modalities, or problem criteria, as opposed to allocating separate encoder stacks for each. This approach introduces parameter efficiency, improved sample efficiency, and the ability to transfer or regularize across heterogeneous tasks or modalities, leveraging shared inductive biases in the learned representations. Variants include hard parameter sharing for multi-task learning, architectural unification for cross-modal models, and intra-layer weight-sharing strategies for compression.

1. Core Principles and Architectural Patterns

All shared Transformer encoder systems instantiate a common mechanism: a single stack of Transformer layers receives input from one or multiple sources, projecting these through shared parameters to yield context-aware representations. Task- or modality-specific distinctions are typically injected via either additional input tokens/embeddings or by lightweight task/modality-specific output heads.

  • Hard parameter sharing: All L Transformer layers in the encoder are parameter-shared across different tasks or modalities. For example, in multi-task hate speech and emotion detection, all tasks share θshared\theta_{\text{shared}} in the contextualizer (Mnassri et al., 2023).
  • Modality/task-differentiated input augmentation: Approaches prepend special tokens (e.g., criterion, modality) or concatenate modality-specific vectors to the token embeddings, conditioning the shared encoder's function appropriately (Qiu et al., 2019, Roy et al., 3 Mar 2025).
  • Task/modality-specific output heads: Dedicated classification or regression heads (typically linear layers and softmax) operate atop a shared hidden state (e.g., [CLS] vector or mean pooled embedding), enabling downstream specialization (Mnassri et al., 2023, Qiu et al., 2019).
  • Intra-encoder weight sharing for compression: Rather than sharing across tasks or modalities, rows of Transformer layers themselves may share weights with low-rank private residuals to reduce overall model size, as in ResidualTransformer (Wang et al., 2023).

2. Mathematical Formalizations and Loss Structures

Formalization follows standard Transformer notation but with key parameter-tying or multi-task objectives.

Shared Encoder Stack

Given input X={x1,,xn}X=\{x_1,\ldots,x_n\}, after tokenization/embedding,

H=Transformer(X;θshared)Rn×d.H = \operatorname{Transformer}(X; \theta_{\text{shared}}) \in \mathbb{R}^{n \times d}.

For task tt, output head computes logits for classification/regression:

z(t)=W(t)h[CLS]+b(t);y^(t)=softmax(z(t)).z^{(t)} = W^{(t)} h_{\text{[CLS]}} + b^{(t)};\quad \hat{y}^{(t)} = \operatorname{softmax}(z^{(t)}).

For multi-task setups,

Ltotal=tλtLCE(y(t),y^(t)),\mathcal{L}_{\text{total}} = \sum_t \lambda_t\, \mathcal{L}_{\text{CE}} (y^{(t)}, \hat{y}^{(t)}),

where λt\lambda_t are task weights (Mnassri et al., 2023).

Multimodal Shared Encoders

Input tokens for each modality:

  • Modality vector (append): hM0=[[eM1;vM],,[eMs;vM]]Rs×(d+f)h^0_{M} = [[e_M^1; v_M], \ldots, [e_M^s; v_M]] \in \mathbb{R}^{s \times (d+f)} (Roy et al., 3 Mar 2025).
  • Modality token (prepend): hM0=[eM,eM1,,eMs]R(s+1)×dh^0_M = [e_M, e_M^1, \ldots, e_M^s] \in \mathbb{R}^{(s+1) \times d}.

Contrastive losses (CLIP-style) operate on the joint output to align modalities:

Lcon=1Ni=1N[logezIi,zTi/τj=1NezIi,zTj/τ+logezTi,zIi/τj=1NezTi,zIj/τ]\mathcal{L}_{\mathrm{con}} = - \frac{1}{N} \sum_{i=1}^{N} \left[ \log \frac{e^{\langle z_I^i, z_T^i\rangle/\tau}}{\sum_{j=1}^N e^{\langle z_I^i, z_T^j\rangle/\tau}} + \log \frac{e^{\langle z_T^i, z_I^i\rangle/\tau}}{\sum_{j=1}^N e^{\langle z_T^i, z_I^j\rangle/\tau}} \right]

(Roy et al., 3 Mar 2025).

Weight-Sharing Within Encoder Layers

Each linear projection in the Transformer encoder is parameterized as

W(l)=S(g)+A(l)B(l)+D(l),where g=l/K.W^{(l)} = S^{(g)} + A^{(l)} B^{(l)} + D^{(l)},\quad \text{where } g = \lfloor l/K \rfloor.

S(g)S^{(g)} is a full-rank matrix shared by blocks of KK consecutive layers, A(l)B(l)A^{(l)} B^{(l)} is a low-rank residual, and D(l)D^{(l)} is an optional small diagonal (Wang et al., 2023).

3. Application Modalities: Multi-Task, Multimodal, Multicriteria, and Compression

The shared encoder pattern appears in a spectrum of application areas:

  • Multi-task learning: Simultaneous classification over distinct label spaces—e.g., hate/offensive detection + emotion recognition via a shared BERT/mBERT encoder improves F1 scores (up to +3 points) and substantially reduces false positives by leveraging shared affective representations (Mnassri et al., 2023).
  • Multimodal representation learning: Unified encoders for both text and images, such as in MoMo and specialized medical retrieval, use positional, modality, and token embeddings to successfully align semantic spaces and improve performance on both data-rich and data-constrained benchmarks (Chada et al., 2023, Roy et al., 3 Mar 2025).
  • Multi-criteria tagging: In Chinese word segmentation, a criterion-token is prepended to each input, conditioning a shared encoder. This supports fast transfer to new annotation criteria and enables handling of mixed-script data with negligible F1 degradation (<0.05) (Qiu et al., 2019).
Research Area Model Design Principal Dataset(s)
Multi-task NLP BERT/mBERT shared encoder Davidson, GoEmotions
Multimodal Single ViT-style encoder ImageNet, Wikibooks, PMD, MIMIC-CXR
MCCWS Shared Transformer, CRF Eight CWS corpora
Compression ResidualTransformer Speech ASR/ST (10k h)

4. Empirical Outcomes and Parameter Efficiency

Empirical studies consistently show that shared encoder architectures yield gains in data efficiency, memory/computational cost, and sometimes even absolute performance, especially under limited data scenarios.

  • In multi-task hate speech and emotion detection, multi-task shared-encoder models (BERT/mBERT) achieve an F1 macro score improvement up to +3 points for hate detection and reduce false positive rates (e.g., BERT-STL: 14.4% vs. BERT-MTL: 1.06%) (Mnassri et al., 2023).
  • Multimodal shared encoders (MoMo) rival larger systems (FLAVA, CLIP) using 2/5th the parameters and 1/3rd the paired data, with up to +3.1% gains in multimodal benchmarks (Chada et al., 2023). In medical settings, shared encoders with a tiny modality vector yield up to 94% relative gain in Recall@200 in the lowest-data regime compared with separate encoders (Roy et al., 3 Mar 2025).
  • For speech recognition and translation, weight-sharing across encoder layers in ResidualTransformer achieves a ≈3× parameter reduction with only 1.8% relative increase in WER (13.28% → 13.52%) and ≤1.4 BLEU drop (Wang et al., 2023).
  • Joint multilingual training with shared encoders, as in spoken term detection, shows stabilizing effects and increases maximum term-weighted value (MTWV) in cross-lingual tasks (Švec et al., 2022).

5. Regularization, Adaptation, and Training Procedures

Parameter sharing acts as an implicit regularizer by constraining representational freedom, thereby reducing overfitting to individual tasks or modalities. Gradient updates accumulate from all active heads into the shared layers, regularizing the encoder and improving generalization (Mnassri et al., 2023, Chada et al., 2023). When new criteria or modalities are encountered, rapid adaptation can be achieved via lightweight fine-tuning of embedded tokens or vectors (e.g., criterion-embeddings for new Chinese word segmentation criteria) (Qiu et al., 2019, Roy et al., 3 Mar 2025).

In training, best practices include:

6. Limitations, Trade-offs, and Variations

Performance benefits of shared encoders are generally robust but not universal. For example, multi-task gains in offensive language detection are less significant than for hate speech (Mnassri et al., 2023). In medical multimodal retrieval, performance gains of shared encoders over separate encoders are most pronounced in the low-data regime; with abundant data, improvements become marginal (Roy et al., 3 Mar 2025). When merging all data types in early training stages, modality-specific performance can degrade unless proper scheduling or gradient balancing is enforced (Chada et al., 2023).

Some designs admit lightweight modality- or task-specific layers before or after the shared encoder stack, balancing inductive sharing with limited specialization. Ablations show that early insertion (before the shared encoder) provides modest gains (Roy et al., 3 Mar 2025).

7. Outlook and Impact Across Domains

The shared Transformer encoder paradigm has enabled models to efficiently generalize across tasks (multi-task learning), criteria (multi-criteria tagging), and modalities (vision-language). It enhances sample efficiency—critical for low-resource and data-scarce domains (especially in biomedical applications)—and compresses model size for deployment in resource-constrained environments. These results have reoriented many pipeline architectures from multi-stream and dual-encoder patterns toward unified, parameter-shared backbone models in a variety of deployment contexts (Mnassri et al., 2023, Chada et al., 2023, Roy et al., 3 Mar 2025, Wang et al., 2023).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Shared Transformer Encoder.