Shared Encoder Architecture

Updated 23 February 2026

Shared Encoder Architecture is a framework where one encoder processes diverse tasks and modalities using hard parameter sharing.
It improves transfer learning, reduces overfitting, and lowers memory usage by consolidating feature extraction across domains.
Applications include multilingual NLP, multimodal learning, and multi-task vision, achieving robust performance especially in low-data regimes.

A shared encoder architecture is an organizational paradigm in machine learning wherein a single parameterized encoder network processes multiple tasks, modalities, or data streams, either alone or in conjunction with task-specific decoders or heads. This construct enables parameter efficiency, inductive transfer, and regularization via hard parameter sharing, often yielding superior performance, reduced overfitting, and greatly improved compute/memory efficiency—particularly in data-limited regimes or multi-domain/multimodal tasks. The shared encoder principle manifests in multilingual NLP, multimodal representation learning, multi-task vision, speech, low-level signal inference, and compressed context modeling in LLM-based systems.

1. Foundational Principles and Canonical Architectures

A shared encoder is defined as a parameterized feature extraction module—most commonly a stack of Transformers, ResNets, or MLPs—employed identically across multiple tasks, data modalities, or streams. The central attribute is hard parameter sharing: all relevant inputs are processed through an identical parameter set, with no or only minimal adaptation per task or modality. This enables the architecture to capture universal, task-independent representations and constrains total parameter count.

Variants include:

Encoder–multi-decoder: Shared encoder feeding multiple task- or modality-specific decoders as in multi-task or multimodal models (Jiang et al., 2019, Merizzi et al., 12 Jun 2025).
Encoder–encoder with parameter sharing: Two or more “branches” employing the identical encoder parameters (e.g., for cross-modal or cross-stream matching) (Švec et al., 2022).
Unified modality encoder: A single encoder handling both text and image (or more) streams, often with modality-identifying embeddings (Chada et al., 2023, Roy et al., 3 Mar 2025).
Modular decoders: A shared encoder combined with a set of discrete, often independently trainable decoders (e.g., for explicit specialization or parallelism) (Rangarajan et al., 29 Jun 2025).
Multi-task classification with shared encoder + separate heads: Classical in NLP, where a single BERT encodes all inputs, and small MLPs perform task-specific prediction (Mnassri et al., 2023).

In contrast, language-specific or modality-specific architectures instantiate dedicated encoders per task or language, trading off greater capacity for reduced transfer and increased parameter budget (Escolano et al., 2020).

2. Cross-domain and Application-specific Realizations

Multimodal and Multitask Domains

Multimodal Representation Learning: Shared encoders with modality-type embeddings or tokens process disparate input types in medical (Roy et al., 3 Mar 2025), vision-language (Chada et al., 2023), or general multimodal regimes. These exploit modality embeddings (e.g., learnable vectors $\mu_m$ concatenated to or prepended in the sequence) and sometimes shallow modality-specific towers for specialization, but the core feature extraction is universally shared. Notably, in MoMo, both image patches and text tokens are projected via the same ViT backbone, demonstrating robust transfer and parameter economy (Chada et al., 2023).

Multi-task Vision: Architectures such as SENSE for scene flow (Jiang et al., 2019) and SwinMTL for joint depth/segmentation (Taghavi et al., 2024) use a shared CNN or transformer backbone, with task-specific decoders for optical flow, stereo, occlusions, segmentation, depth, or similar outputs. In both, efficiency (reduced parameters, memory) and performance gains are observed due to shared low-level features, with empirical ablations confirming improvements over separate-task baselines.

Encoder-Encoder Models for Matching Tasks: In spoken term detection, the architecture involves two parallel pipelines—with all Transformer sub-blocks fully shared—processing hypothesis and query inputs, and projecting both into a joint embedding space for dot-product matching (Švec et al., 2022). Only input-specific embedding and convolutional layers are unshared; all deep layers share parameters.

Multilingual and Domain-general NLP

Multilingual Sentence Encoders: Enforcing a single encoder for all languages, combined with decoder-side language indication tokens, enables transfer across high- and low-resource languages. The shared encoder, pre-trained on translation and denoising autoencoding, provides cross-lingual generalization and parameter efficiency for STS and translation (Tang et al., 2018).

Compressed Context for LLMs: ARC-Encoder employs a shared encoder to produce compressed vector sequences substituting for full text token embeddings in a variety of frozen LLM decoders. The same set of encoder parameters, combined with small adapter MLPs per decoder and special tokens, enables context packing and portable adaptation to multiple LLMs with minimal additional tuning (Pilchen et al., 23 Oct 2025).

3. Mathematical Formalization and Training Objectives

A canonical shared encoder system processes input $x$ (from task/modality $m$ ) as: $h_m^0 = \mathrm{Embed}_m(x) \oplus \mu_m$

$h^l = \mathrm{EncoderLayer}^l(h^{l-1}; \theta), \quad l=1\dots L$

$z_m(x) = \mathrm{Extract}(h^L)$

Here, $\theta$ is shared, and $\mu_m$ is a small, learnable modality- or task-identifying vector (possibly omitted).

The downstream objective may be:

Multi-task classification: $\mathcal{L} = \sum_m \lambda_m\,\mathcal{L}_m$ where each task's loss is computed via a small head atop $z_m(x)$ (Mnassri et al., 2023, Taghavi et al., 2024).
Contrastive multimodal learning: CLIP-style symmetric loss for paired inputs $(x_i^{img}, x_i^{txt})$ (Roy et al., 3 Mar 2025): $\mathcal{L}_{con} = -\frac1N\sum_{i=1}^N \Big[\log \frac{\exp(\langle z_I^i,z_T^i\rangle/\tau)}{\sum_j \exp(\langle z_I^i,z_T^j\rangle/\tau)} + \log\frac{\exp(\langle z_T^i,z_I^i\rangle/\tau)}{\sum_j \exp(\langle z_T^i,z_I^j\rangle/\tau)}\Big]$
Task-specific objectives: regression for depth, cross-entropy for segmentation (Taghavi et al., 2024), or reconstructive or dot-product scoring for alignment (Švec et al., 2022).

Training proceeds with mini-batches from all tasks/modalities, backpropagating gradients jointly through the shared encoder, and exclusively through the relevant head/tower.

Parameter sharing allows reduction in the number of parameters that require estimation, especially crucial in low-data regimes (Roy et al., 3 Mar 2025).

4. Implementation, Scalability, and Efficiency Considerations

Parameter and Memory Efficiency: Sharing the encoder drastically reduces parameter count and memory footprint versus maintaining $N$ separate encoders. For instance, SENSE’s four-task system uses 13.4 M parameters (shared backbone + heads) vs. 17.1+ M for separate models, and FlowNet3 (separate models) uses 234 M (Jiang et al., 2019). MoMo’s base multimodal encoder achieves competitive performance on vision/language tasks with 110 M parameters versus 241 M in FLAVA (Chada et al., 2023).

Inference Speed and Resource Utilization: Shared encoder models such as 1EMD for multi-variable climate downscaling achieve ~25% faster inference per variable due to running the transformer only once per input (Merizzi et al., 12 Jun 2025).

Flexible Specialization: Some architectures provide a purely shared encoder; others add minimal per-task or per-modality towers (often just one or two transformer layers) for further specialization—yielding the best sample efficiency/performance trade-off in limited data settings (Roy et al., 3 Mar 2025).

Dynamic Model Size Extraction: The unified cascaded encoder for ASR allows extraction of sub-models of different depths, all reusing the same parameter set but with separate decoders, for deployment scenarios with varying compute/latency constraints. This results in 36-37% total size reduction with negligible quality loss versus separately trained models (Ding et al., 2022).

5. Empirical Outcomes and Evaluative Studies

Shared encoder variants consistently demonstrate:

Performance Gains in Multi-task and Multimodal Regimes: For SENSE, joint training with a shared encoder improves both in-domain accuracy and enables semi-supervised/teacher-distillation learning when ground-truth for certain tasks is absent (Jiang et al., 2019). In SwinMTL, both depth and segmentation tasks achieve higher metrics when trained together versus separately (Taghavi et al., 2024).
Superior Generalization in Low-data Regimes: In medical multimodal retrieval, shared encoder architectures deliver substantial improvements as training data decreases, with relative recall@200 improvement of +81% at 0.66 M samples versus a modality-specific baseline (Roy et al., 3 Mar 2025).
Cross-task/variable Transfer: In 1EMD, joint cross-variable climate downscaling exhibited lower MSE/MAE/SSIM for all variables, showing the efficacy of joint spatial representation (Merizzi et al., 12 Jun 2025).
Regularization and Overfitting Reduction: Emotion-aware abusive language detection reduced false positives and improved F1 by 2-3% over single-task baselines, as the auxiliary emotion task supplied regularizing gradients (Mnassri et al., 2023).
Portability and Adaptation: ARC-Encoder demonstrates that a single encoder can be adapted to multiple frozen LLM decoders across instruction and base LLMs with only small adapter MLPs per decoder, avoiding retraining or modification of LLM weights (Pilchen et al., 23 Oct 2025).

6. Limitations and Variations

Despite the benefits, shared encoder architectures exhibit notable trade-offs:

Reduced Zero-shot Transfer in Modular Systems: In modular multilingual translation (language-specific encoder/decoder, only “interlingua” aligned via training), lifelong extension is enabled, but zero-shot performance can lag the universal shared-encoder baseline—by 1.4 BLEU points (Escolano et al., 2020).
Potential for Negative Transfer: In multi-task settings, if task signals are not sufficiently aligned, shared encoders may cause negative transfer, requiring careful selection or weighting of tasks (Mnassri et al., 2023).
Limited Specialization: Fully shared encoders may underperform dedicated ones for tasks requiring highly specialized features; lightweight per-task towers can ameliorate this in some contexts (Roy et al., 3 Mar 2025).

7. Notable Directions and Open Questions

Current shared encoder research trends encompass:

Highly scalable multimodal encoders: Extension to vision, text, audio, and structured data, exploiting learnable modality tokens, synthetic supervision, or scheduled multi-dataset training (Chada et al., 2023, Roy et al., 3 Mar 2025).
Fine-grained adaptation/adapters: Per-decoder or per-modality heads/MLPs enable efficient adaptation to frozen LLMs or specialized downstream tasks (Pilchen et al., 23 Oct 2025).
Architectures supporting on-the-fly model sizing: E.g., ASR super-nets that allow rapid deployment for diverse devices by partial execution of the shared encoder (Ding et al., 2022).
Self-supervised and distillation-based multi-task hybrids: Incorporation of distillation losses and combinations of labeled and unlabeled objectives to fill in annotation gaps (Jiang et al., 2019, Taghavi et al., 2024).

Fundamental open issues include the optimal allocation of shared vs. specialized parameters, strategies for aligning modalities and tasks with maximally positive transfer, and mitigating interference when tasks are only weakly coupled.

References

(Švec et al., 2022) Transformer-based encoder-encoder architecture for Spoken Term Detection
(Merizzi et al., 12 Jun 2025) Vision Transformers for Multi-Variable Climate Downscaling: Emulating Regional Climate Models with a Shared Encoder and Multi-Decoder Architecture
(Rangarajan et al., 29 Jun 2025) SIEDD: Shared-Implicit Encoder with Discrete Decoders
(Pilchen et al., 23 Oct 2025) ARC-Encoder: learning compressed text representations for LLMs
(Escolano et al., 2020) Multilingual Machine Translation: Closing the Gap between Shared and Language-specific Encoder-Decoders
(Tang et al., 2018) Improving Multilingual Semantic Textual Similarity with Shared Sentence Encoder for Low-resource Languages
(Taghavi et al., 2024) SwinMTL: A Shared Architecture for Simultaneous Depth Estimation and Semantic Segmentation from Monocular Camera Images
(Jiang et al., 2019) SENSE: a Shared Encoder Network for Scene-flow Estimation
(Chada et al., 2023) MoMo: A shared encoder Model for text, image and multi-Modal representations
(Roy et al., 3 Mar 2025) A Shared Encoder Approach to Multimodal Representation Learning
(Ding et al., 2022) A Unified Cascaded Encoder ASR Model for Dynamic Model Sizes
(Lopes et al., 2019) Unbabel's Submission to the WMT2019 APE Shared Task: BERT-based Encoder-Decoder for Automatic Post-Editing
(Mnassri et al., 2023) Hate Speech and Offensive Language Detection using an Emotion-aware Shared Encoder
(Li et al., 2020) Pixel-Semantic Revise of Position Learning A One-Stage Object Detector with A Shared Encoder-Decoder