Sharing ViT (SV) Design

Updated 12 July 2025

Sharing ViT Design is a set of unified methods reusing transformer components like backbones and attention maps across tasks to reduce redundancy.
It employs techniques such as token pooling, dynamic routing, and adapter modules to enhance computational efficiency and performance.
The design significantly improves accuracy and robustness in applications like video segmentation by sharing features between tasks like depth estimation and instance segmentation.

The Sharing ViT (SV) Design refers to a family of architectural and methodological strategies for Vision Transformers where network components—such as feature extraction backbones, attention computations, or token mechanisms—are expressly reused, unified, or shared across tasks, inference configurations, or modalities. Originally arising from the need to address redundancies, inefficiencies, and the scaling challenges of deep transformers, SV design schemes are increasingly adopted to improve computational efficiency, representation robustness, transferability, and resource adaptation in a broad array of computer vision applications.

The concept of a "shared" Vision Transformer backbone is anchored in two observations: (1) that many transformer model computations—especially in attention, patch embedding, and intermediate feature encoding—encode broadly useful representations beyond any single task; (2) that different vision tasks, or even different tasks and modalities, can benefit from early or mid-level features extracted by a well-trained or pre-trained ViT. In practice, SV architectures implement this sharing at several levels:

Unified Backbone Usage: A single ViT (or ViT-derived) backbone is used to extract features for multiple tasks or branches, such as segmentation and monocular depth estimation (Niu et al., 8 Jul 2025), or across downstream branches for detection, retrieval, or video processing.
Parameter and Computation Sharing: Attention weights, token embeddings, or intermediate representations (such as multi-scale feature maps) are explicitly reused or "stitched" between different branches and inference heads, minimizing redundancy and aligning representational spaces.

Such shared designs may freeze the backbone (to prevent overfitting or loss of generalization), use dedicated adapters for each task head, or dynamically allocate compute via expert selection, width slicing, or token pooling.

SV architectures achieve efficient sharing using several technical innovations:

Shared Frozen Backbones: As in the robust video instance segmentation work, a single pre-trained ViT (e.g., ViT-L with DINOv2 pretraining) is reused for both the depth estimation and instance segmentation branches. Depth and RGB-based features pass through the same backbone, aligning geometric and appearance cues at every layer without duplication of heavy computation (Niu et al., 8 Jul 2025).
Token Pooling and Attention Sharing: Methods like PSViT dynamically adjust the number of active tokens and reuse attention maps across layers, reducing computation by exploiting redundancy in spatial encoding and attention similarity between adjacent layers (Chen et al., 2021).
Mixture-of-Experts and Conditional Routing: SV designs may embed sparse Mixture-of-Experts (MoE) modules within the transformer blocks, with only task-specific experts activated per branch or inference context. This leads to massive FLOPs reductions and enables efficient multi-task deployment (Liang et al., 2022).
Subnetwork Slicing and Stitching: Frameworks such as SN-Netv2 and Scala introduce dynamic sharing by slicing across width (channels) or depth (blocks), or by "stitching" between pre-trained networks of different sizes. This allows one "supernet" to represent a continuum of efficiency/performance variants, which can be selected or instantiated on-the-fly, supporting flexible deployment under resource constraints (Zhang et al., 2024, Pan et al., 2023).
Micro-level Feature Fusion: In multi-modal or multi-task scenarios, downstream adapters or task-specific heads ingest shared ViT features along with task-targeted inputs (e.g., combining a computed depth map with RGB features for segmentation (Niu et al., 8 Jul 2025)).

3. Performance, Robustness, and Efficiency

The hallmark of SV designs is their ability to deliver improved accuracy, increased robustness, and greater computational/memory efficiency:

Improved Robustness in Perceptual Tasks: On video instance segmentation benchmarks (e.g., OVIS), jointly sharing ViT backbone features between geometric (depth) and appearance (RGB) tasks leads to more robust tracking and segmentation, especially under occlusion and motion blur. For instance, a shared ViT backbone improved AP from 49.6 to 50.1 and up to 55.8 AP when combined with additional refiners, compared to non-sharing baselines (Niu et al., 8 Jul 2025).
Resource Efficiency: Sharing allows model weights and activations to be reused for multiple heads, yielding substantial savings in inference time and memory. For example, MoE-based sharing yields up to 88% reduction in FLOPs for single-task inference in multi-task models (Liang et al., 2022), while quantization-aware sharing (QSViT) achieves more than 20% memory and power savings with negligible accuracy drop (Putra et al., 1 Apr 2025).
Deployment Flexibility: Architectures supporting weight slicing and dynamic routing can be "scaled" to meet runtime constraints—enabling a single pre-trained ViT to represent a spectrum of smaller (or faster) submodels without retraining (Zhang et al., 2024, Pan et al., 2023).
Cross-task/Domain Transfer: Sharing design supports transfer learning, enabling pre-trained ViTs to be adapted efficiently for new tasks, including few-shot or privacy-preserving scenarios (e.g., split adaptation with protected, noise-perturbed front-ends (Wang et al., 1 Mar 2025)).

Underlying most SV designs is the principle of a common feature function: $F = \text{ViT}(I)$ where the shared feature map $F$ is used as input to multiple task-specific heads, such as: $D = f_\text{depth}(F), \quad S = f_\text{seg}(F, D)$ for depth and segmentation (Niu et al., 8 Jul 2025). When fusing geometric and appearance information, the concatenation of tasks' outputs becomes: $\hat{V} = \text{concat}(I, D)$ and is fed to the relevant downstream head, leveraging the shared embedding space.

In computational sharing, stochastic selection or slicing determines which subset of weights or experts to activate per inference instance: $\theta^r = \theta[:(r \times C_o),\; :(r \times C_i)]$ where $r$ is the width ratio chosen at runtime, enabling flexible model scaling (Zhang et al., 2024).

In video instance segmentation under occlusion and motion blur, sharing ViT designs leverage a pre-trained backbone for both monocular depth estimation and segmentation (Niu et al., 8 Jul 2025):

An incoming RGB frame $I$ passes through the frozen ViT backbone, resulting in a feature tensor $F$ .
A depth prediction head uses $F$ to predict a depth map $D$ .
The segmentation head (e.g., using ViT-Adapter) ingests both $F$ and $D$ ; the combined RGBD representation enables the model to associate instance boundaries in 3D space, which is crucial for resolving ambiguities in complex video scenes.
This sharing avoids duplicating heavy feature extractors for each modality; both depth and segmentation tasks directly benefit from the aligned, high-level features available through the unified ViT.

This approach yields measurable improvements, providing consistent AP gains over non-shared backbones—even though depth supervision as an auxiliary loss yields marginal value, emphasizing the importance of direct feature sharing for robust scene understanding.

6. Practical Implementation and Deployment Considerations

Implementing an SV design requires attention to several deployment aspects:

Frozen vs. Fine-tuned Backbones: In scenarios where the shared backbone is frozen (as in depth+segmentation), computational savings are maximized, and representational drift is avoided. Fine-tuning only adapters or task heads preserves generalization while leveraging shared features.
Task Head Modularity: Downstream branches must be adapted to utilize multi-modal features or coordinate with sampling/routing strategies for efficient sharing. Adapter modules are commonly used to bridge between generic and task-specific representation needs.
Quantization and Precision Tuning: Sharing can be further optimized for embedded or resource-constrained environments by applying per-layer quantization (e.g., QSViT), reducing memory and power usage with minimal accuracy loss. However, sensitivity analysis is advised to avoid over-quantizing critical layers (Putra et al., 1 Apr 2025).
Multi-device and Distributed Training: Horizontally scalable designs (e.g., HSViT) distribute feature extraction and attention computation across multiple devices, aggregating predictions via voting or averaging. This supports collaborative/federated deployment (Xu et al., 2024).

7. Challenges and Future Directions

Challenges in applying SV designs include:

Managing Interference: Weight sharing can cause negative transfer if the tasks are not sufficiently aligned or if one task dominates gradients. Mitigation strategies include isolated activation for the smallest subnets and progressive knowledge transfer among sliced variants (Zhang et al., 2024).
Determining Sharing Granularity: The optimal degree of parameter sharing (at token, block, or head level) and adaptive depth/width slicing bounds remains an open area, especially as hardware and use cases diversify.
Transfer and Robustness: While SV models show strong promise across classification, detection, and segmentation, broader validation across modalities (e.g., text, audio), task distributions, and adversarial robustness remains active research.

A plausible implication is that as SV designs mature—incorporating advanced quantization, more sophisticated mixture-of-expert routing, and robust privacy-preservation mechanisms—they could provide a universal, adaptive backbone for multi-task, multi-modal vision-LLMs, bridging the gap between efficient deployment and high-capacity representation.

In summary, Sharing ViT designs represent a broad set of architectural and methodological patterns for unifying, reusing, and efficiently scaling Vision Transformers across tasks, inference scales, and resource budgets. Through strategies such as backbone sharing, token pooling, dynamic expert activation, pipeline slicing, and quantization, SV approaches deliver measurable gains in efficiency, robustness, flexibility, and practical deployability in demanding real-world computer vision applications.