MobileViM: Efficient Lightweight Vision
- MobileViM is a family of lightweight neural modules designed for resource-efficient vision tasks such as human pose estimation and 3D medical segmentation.
- The MobileViM Block fuses local convolutional processing with a non-self-attention global modeling mechanism, reducing parameters and FLOPs while improving predictive accuracy.
- Its dimension-independent design and cross-scale bridging enable high-throughput volumetric segmentation with significant speed and Dice score improvements.
MobileViM refers to a family of lightweight, efficient neural network modules and architectures targeting vision-related tasks under resource or real-time constraints. The term currently appears in two distinct research strands: (1) a block-level module for global modeling in lightweight single-branch convolutional structures (notably within the LGM-Pose system for human pose estimation) (Guo et al., 5 Jun 2025), and (2) a dimension-independent Mamba-based framework for volumetric 3D medical image analysis designed for high-throughput segmentation (Dai et al., 19 Feb 2025). Both usages emphasize parameter and compute efficiency, as well as competitive or superior predictive performance compared to established lightweight baselines.
1. MobileViM Block in Lightweight Global Modeling Networks
Within the LGM-Pose architecture, the MobileViM Block is introduced to address the fundamental difficulty of modeling complex spatial dependencies in lightweight convolutional networks, which typically rely on multi-branch parallelism and struggle to capture global context efficiently. The MobileViM Block integrates global and local information at negligible parameter and FLOP cost by combining standard convolutions with a non-self-attention-based global modeling submodule (Guo et al., 5 Jun 2025).
Structural Overview
The MobileViM Block sequentially applies:
- A convolution (spatial locality) followed by batch normalization and activation.
- A convolution for channel lift .
- The Lightweight Attentional Representation Module (LARM), which provides global context integration using only MLPs and non-parametric tensor rearrangements (without dot-product attention).
- A convolution projecting back to channels.
- Residual addition of the block input and output.
This module is interleaved with MobileNetV2 bottleneck blocks for downsampling in the LGM-Pose network and is followed by decoder, fusion, and prediction stages.
2. Lightweight Attentional Representation Module (LARM) and the Non-Parametric Transformation Operation
LARM replaces traditional attention with two parallel MLPs that exchange information both across (inter-patch) and within (intra-patch) patches by pure permutation and reshape operations.
Given input :
- Patch Decomposition (NPT-Op.1): Reshape into , where is the patch area and is the number of patches.
- Inter-patch Interaction: For each pixel position , , apply a two-layer MLP along the dimension:
- Permutation (NPT-Op.2): Permute to .
- Intra-patch Interaction: For each patch , , apply another MLP along the dimension:
- Fold Back (NPT-Op.3): Reshape and permute back to .
The result approximates global self-attention with no quadratic computational overhead and near-zero additional parameters.
3. Quantitative Efficiency and Performance
Empirical evaluation on human pose estimation tasks demonstrates substantial computational advantages. For instance, the MobileViM Block in LGM-Pose achieves:
| Block | Params (M) | GFLOPs | [email protected] (%) | FPS-CPU | FPS-GPU |
|---|---|---|---|---|---|
| MobileViT Block [18] | 2.5 | 1.4 | 87.9 | 20.1 | 69.4 |
| MobileViM Block (ours) | 1.1 | 0.9 | 88.4 | 24.4 | 79.9 |
This configuration reduces parameters by more than 50% and FLOPs by 35%, with a 0.5 percentage point accuracy gain; runtimes improve by 20–30% on both CPU and GPU (Guo et al., 5 Jun 2025). When deployed in the full LGM-Pose network, the backbone with MobileViM achieves 68.6–71.0 AP on COCO and 88.4 PCKh on MPII, outperforming MobileNetV2 and ShuffleNetV2 with only ~12% of their parameter count.
4. Dimension-Independent Vision Mamba for 3D Medical Image Analysis
A separate line of work leverages the Vision Mamba (ViMamba) paradigm, combining SSM-based Mamba modules with convolutional layers for volumetric segmentation. The MobileViM architecture introduces dimension-independent mechanisms ("Dimin"), dual-direction traversing, and cross-scale bridging, facilitating tractable and high-velocity volumetric modeling (Dai et al., 19 Feb 2025).
Major Technical Components
- Dimension-Independent Mechanism (Dimin): Each 3D input volume is split into patches, flattened, and processed along each axis separately. This reduces complexity from to for SSM computation, enabling practical processing of gigavoxel medical images.
- Dual-Direction Mamba: Bidirectional filtering along each patch sequence via Mamba modules enhances contextual encoding, especially at anatomical boundaries.
- Cross-Scale Bridging: Early-stage high-resolution features are up-sampled and injected into deeper layers, offsetting compression artifacts and preserving detail.
5. Empirical Benchmarks and Deployment
Performance on 3D medical image datasets demonstrates MobileViM's efficiency:
| Model | Params (M) | MACs (B) | FPS | Dice PENGWIN | Dice BraTS2024 | Dice ATLAS | Dice ToothFairy2 |
|---|---|---|---|---|---|---|---|
| MobileViM_s | 6.29 | 195.6 | 91 | 92.72% | 86.69% | 80.46% | 77.43% |
| MobileViM_xs | 2.89 | 131.6 | 94 | 89.97% | 86.18% | 79.65% | 75.54% |
Compared to seven state-of-the-art methods, MobileViM_s is over 20 FPS faster, with Dice improvements up to +7% over SegMamba on BraTS2024. Training and inference are performed in PyTorch on an NVIDIA RTX 4090. The cross-scale bridging and Dimin mechanism jointly contribute to the observed gains in both speed and segmentation quality (Dai et al., 19 Feb 2025).
6. Limitations and Future Directions
MobileViM, in both its forms, relies on sufficient data to prevent bias in learned modules, particularly in Mamba-based architectures where training data size can impact generalization. The current dual-scale design may be less suitable under extreme hardware limitations, and future variants may explore further adaptive or quantized schemas (Q-Mamba). Research directions include the development of Foundation Mamba models for transfer learning on large unlabeled medical corpora and block-allocation strategies to optimize accuracy–latency trade-offs per instance (Dai et al., 19 Feb 2025).
A plausible implication is that MobileViM-style modules, which combine global context modeling, resource regularization, and deployment-friendly design, will become central components in efficient, domain-adapted real-time vision systems where model compactness and latency are imperative.