Domain-Adapted MobileNetV2 & V3

Updated 22 December 2025

Domain-adapted MobileNetV2 and MobileNetV3 are lightweight CNN architectures optimized for cross-domain visual tasks using transfer learning, quantization, and adaptive architecture search.
They employ strategies like early-exit adaptation, selective fine-tuning, and width multiplier adjustments to balance detection performance and efficiency in diverse environments.
Unsupervised adversarial frameworks and multi-path NAS enable these models to achieve state-of-the-art accuracy while lowering computational cost and energy consumption on resource-constrained hardware.

Domain-adapted MobileNetV2 and MobileNetV3 architectures extend baseline MobileNets to address the specific challenges of cross-domain deployment, data scarcity, resource constraints, and diverse visual recognition tasks. They do so through targeted transfer learning, pipeline-level domain adaptation, multi-path network design, unsupervised adversarial frameworks, and post-training quantization, achieving state-of-the-art efficiency and enabling practical inference on-device or in real-time settings.

1. Architectural Modifications and Pipeline Adaptation

Domain adaptation of MobileNetV2 and MobileNetV3 is accomplished through minimal architectural changes tailored to preserve lightweight properties. For MobileNetV2, Narduzzi et al. (Narduzzi et al., 2022) implemented two major strategies for a face detection task:

Early-Exit Adaptation: The canonical classification head is truncated at selected backbone layers (“breakpoints”), and replaced with a fully convolutional head for bounding-box regression and object presence classification. Three output strategies (OutA, OutB, OutC) attach detection heads at layers with larger spatial resolutions (e.g., 32×32), preserving fine-scale features for improved localization of small faces.
Freezing and Fine-Tuning: To retain generalizable image features, only higher layers are fine-tuned on the detection task, with earlier blocks frozen (e.g., freezing layers ≤98 yielded optimal trade-offs).
Width Multiplier Tuning: The width multiplier α allows scaling of network capacity: even α=0.5 yields competitive performance at a fraction of the parameters (0.839 AP, 0.42M params, 1.6MB).

For ensemble and multi-domain learning, Kumar et al. (Islam et al., 15 Dec 2025) employed a combination of MobileNetV2, MobileNetV3-Small, and MobileNetV3-Large, all fine-tuned on a non-target subset of the domain and repurposed as expert feature extractors for few-shot recognition. The architecture preserves depthwise separable inverted residuals, employing ReLU6 activations in MobileNetV2 and Hard-Swish and Squeeze-and-Excitation in MobileNetV3.

2. Post-Training Quantization and Hardware Deployment

Deployment to resource-constrained environments requires aggressive quantization and hardware-aware adaptations:

Uniform Q-format Quantization (Narduzzi et al., 2022): To ensure integer-only inference, both weights and activations are quantized post-training in a uniform, symmetric fashion with a single scale per tensor (per-layer calibration). Experiments established that Q9 and Q8 (8–9 bits) quantization cause negligible degradation (<0.5% AP drop), while Q7 or below induces pronounced collapse (AP≪baseline).
Export Pipelines: Models are exported from TensorFlow-Lite (8-bit ops) and converted to hardware-specific formats (e.g., Kendryte KModel) to execute on 8-bit integer DPUs.
Efficiency Metrics: On a Kendryte K210 SoC, the face detection pipeline achieves ~8 FPS at 0.8–1 W, with model sizes as low as 1.6 MB (α=0.5).

These procedures are directly transferable to MobileNetV3, with necessary adjustments for its modified block structure and activation functions.

3. Domain Adaptation via Supervised Transfer and Fusion

Supervised domain adaptation is accomplished through ImageNet pretraining, followed by fine-tuning on a relevant subset:

Full-network Fine-Tuning (Islam et al., 15 Dec 2025): In resource-efficient few-shot classification, all backbone layers are unfrozen and trained on domain-specific data (e.g., PlantVillage non-target classes), then evaluated on true target domains (e.g., unseen tomato or rice disease classes).
Feature-Level Fusion: Outputs from MobileNetV2 and two MobileNetV3 variants yield feature vectors (1280 for V2/Large, 576 for Small), concatenated to a single representation, which is passed—without alignment layers—to a Bi-LSTM classifier. This approach leverages domain-differentiated representations and robustly handles domain shift via feature aggregation.

Quantitative benchmarks show that three-way MobileNet fusion with Bi-LSTM(+Attention) achieves near SOTA accuracy (98.23±0.33% at 15-shot) on PlantVillage while remaining an order-of-magnitude more efficient (1.12 GFLOPs, ~40MB) than transformer ensembles.

4. Multi-Domain NAS and Adaptive Path Selection

Multi-path neural architecture search has been applied to simultaneously optimize MobileNetV2/V3 backbones for multiple visual domains (Wang et al., 2020):

Super-network Construction: The search space spans MBConv blocks with various kernel sizes, expansion ratios, SE module usage, activation types, and output channels. The result is a DAG in which competing candidate operations coexist at each block.
Per-Domain RL Controllers: Domain-specific “controllers” (RNN-based) select block-wise architectures balancing accuracy and compute via reward $R_d(\alpha) = Q_d(\alpha) \cdot (T(\alpha)/T_0)^\beta$ .
Adaptive Balanced Domain Prioritization (ABDP): Gradients are adaptively amplified for harder domains using $h(L_d) = \exp(L_d/w)$ , with $w$ decaying over the search.
Parameter Sharing: Nodes (blocks) sampled by multiple domains share weights, while domain-specific blocks remain private, facilitating both positive transfer and avoidance of negative interference.
Empirical Gains: On the Visual Decathlon benchmark, this approach yields a 1.9% absolute gain in mean accuracy over naively bundled single-domain models, with ~78% reduction in parameters and 32% lower compute cost.

5. Unsupervised Domain Adaptation Using Adversarial and Cycle Consistency Losses

MobileNets have been successfully adapted through UDA frameworks integrating cycle-consistent GANs and feature-level adversaries (Toldo et al., 2020):

Cycle-Consistency GANs: Generators $G_{S\to T}$ and $G_{T\to S}$ translate between source and target images, with dual discriminators enforcing domain realism.
Feature-Level Discriminators: Discriminators $D'_T$ and $D'_S$ are attached at the output of the MobileNet encoder (e.g., in DeepLab-v3+), ensuring that feature distributions are tightly aligned across domains.
Semantic Consistency: Unlabeled target images contribute to loss via a pseudo-labeling scheme based on the network’s own predictions, minimizing $\mathcal{L}_{\text{sem}}$ .
End-to-End Training: After supervised source-only pretraining, all components are optimized jointly with $\mathcal{L}_{\text{total}}$ , combining adversarial, cycle, supervised, feature, and semantic losses. No modifications are made to core MobileNetV2 blocks beyond integration into the UDA pipeline. For MobileNetV3, the only requirement is to update channel dimensions where needed.

This pipeline delivered up to +23.6 percentage point mIoU improvement on GTA5→Cityscapes and +14.4 pp on SYNTHIA→Cityscapes over source-only baselines.

6. Ablative Analysis and Deployment Outcomes

Benchmarking and ablation indicate:

Efficiency/Accuracy Tradeoffs: Quantization (Q9/FP16), width-multiplier tuning, and early-exit attachment yield substantial storage and power savings with <1% performance penalty (Narduzzi et al., 2022).
Classifiers: Bi-LSTM consistently yields superior few-shot classification versus MLPs and unidirectional LSTMs; attention layers may marginally improve performance, especially under noisy real-world conditions (Islam et al., 15 Dec 2025).
Multi-architecture Ensembles: Concatenating features from domain-adapted MobileNetV2/V3 variants achieves high robustness in both laboratory and field settings, outperforming much larger transformer-based ensembles on efficiency and, in low-data regime, on accuracy.

Performance for selected benchmarks:

Setting	Model(s)	Size (MB)	GFLOPs	15-shot Accuracy
PlantVillage, 10 tomato classes	MobileNetV2/V3 Ensemble + BiLSTM	40.4	1.12	98.23 ± 0.33%
PlantVillage, 6 apple/blueberry/cherry	MobileNetV2/V3 Ensemble + BiLSTM	40.4	1.12	99.72 ± 0.12%
Dhan Shomadhan (field, 5 rice diseases)	MobileNetV2/V3 Ensemble + BiLSTM	40.4	1.12	69.28 ± 1.49%

7. Recommendations and Pitfalls in Practice

Best practices and cautions for domain-adapting MobileNet architectures include:

Head-Swap and Early Exit: Attach detection or regression heads at earlier backbone stages with high spatial resolution to optimize the trade-off between compute and detection granularity.
Layer-freezing: Restrict fine-tuning to upper layers to avoid catastrophic forgetting of pretrained representations, especially when adapting to single-class or low-shot domains.
Uniform Quantization: Calibrate quantization per tensor, selecting the highest bit-width supported for minimal accuracy loss. Always validate on actual hardware to preempt quantization artifacts or runtime mismatches.
Negative Transfer: Insufficient differentiation or over-sharing of backbone blocks across unrelated domains can result in reduced accuracy for hard tasks; partitioning domain-specific branches is critical in multi-path settings.
Real-World Deployment: Strong generalization and efficiency documented across edge-device face detection, few-shot plant disease classification, and semantic segmentation for autonomous vehicles.

In summary, MobileNetV2 and MobileNetV3 exhibit high adaptability for diverse domain adaptation settings—supervised transfer, multi-path NAS, adversarial UDA, and hardware-constrained deployment—when coupled with careful architectural selection, quantization, and optimization strategies, enabling high-performance visual inference across multiple mobile and edge-computing scenarios (Narduzzi et al., 2022, Islam et al., 15 Dec 2025, Wang et al., 2020, Toldo et al., 2020).