Segment Any Crack (SAC) Model
- The paper demonstrates that tuning only LayerNorm parameters on a SAM backbone yields state-of-the-art crack segmentation with approximately 0.05% trainable weights.
- It employs efficient fine-tuning strategies that freeze most network weights, significantly reducing computational costs while maintaining high segmentation accuracy.
- SAC shows superior zero-shot generalization across diverse infrastructural domains, making it practical for real-world deployment in resource-constrained settings.
The Segment Any Crack (SAC) model is a class of segmentation frameworks designed to adapt vision foundation models, particularly @@@@1@@@@ (SAM), for pixel-level automated crack detection in diverse civil infrastructure imagery. SAC leverages efficient fine-tuning strategies, enabling robust segmentation with minimal labeled data and significantly reduced computational resources. This approach achieves high accuracy and generalization, notably in zero-shot crack segmentation tasks—segmenting cracks on previously unseen materials, lighting conditions, and structural scenarios. Performance claims, methodological innovations, and computational analyses are based strictly on published metrics and empirical findings (Rostami et al., 19 Apr 2025).
1. Model Architecture and Fine-Tuning Paradigm
SAC is derived from SAM and utilizes its core Vision Transformer (ViT) encoder, prompt encoder, and mask decoder, with modification suited for binary crack segmentation:
- Backbone: SAC retains the ViT-Base pre-trained on the SA-1B dataset (≈90 M parameters).
- Segmentation Head: The original prompt-dependent mask decoder is replaced with a standard binary segmentation head, eliminating prompts and enabling direct segmentation outputs for crack predictions.
- Selective Parameter Tuning: Crucially, SAC freezes all SAM weights except the affine parameters (gain and bias ) of every LayerNorm layer in both encoder and decoder. This targets normalization components for adaptation—addressing covariate shifts between domains with dramatically fewer trainable parameters.
The adaptation strategy exploits the role of normalization in domain generalization: only tuning normalization statistics (LayerNorm) can recalibrate deep feature distributions for new domains without altering representational kernels, as established in transfer learning studies. The resulting parameter set for SAC consists of approximately 41,000 trainable weights—approximately 0.05% of SAM (Rostami et al., 19 Apr 2025).
Mathematical Formulation
Let , where for each LayerNorm .
For input in LayerNorm , output is
Only and are updated during adaptation.
2. Training Protocol and Loss Functions
Datasets
- OmniCrack30k: 22,158 training, 13,277 validation, 4,582 test images from 20 crack image subdomains spanning concrete, asphalt, masonry, and metal.
- Zero-shot Sets: Road420 (420 images), Facade390 (390 images), Concrete3k (3,000 images)—all annotated and resized for crack segmentation (Rostami et al., 19 Apr 2025).
Optimization
- Optimizer: AdamW, weight decay , batch size 2.
- Learning Rate: , cosine decay scheduler.
- Loss Function: Hybrid of binary cross-entropy (BCE) and Dice loss:
where
- Epochs: 4 (as chosen for hyperparameter search and main training, matching protocol from empirical study).
- Implementation: All non-normalization SAM weights are strictly held constant throughout training.
3. Evaluation Protocol and Quantitative Results
Metrics
Crack segmentation is evaluated by pixel-level metrics:
- Precision:
- Recall:
- F1-Score:
- IoU: TP, FP, FN denote pixelwise true positive, false positive, and false negative counts (crack vs background).
Performance Benchmarks
SAC on OmniCrack30k
- F1-Score: 61.22 %
- IoU: 44.13 %
Efficiency Comparison (ViT-Base backbone)
| Tuning Method | # Tunables | % of Backbone | F1 (%) | IoU (%) | Time (min/it) |
|---|---|---|---|---|---|
| No fine-tuning | 0 | 0% | 13.0 | 17.0 | – |
| Decoder only | 3.7 M | 4.17% | 57.97 | 40.83 | 7.9 |
| PEFT (LoRA, r=8) | 30.7 K | 0.034% | 57.95 | 40.81 | 9.9 |
| Ge et al. (PEFT+dec) | 4.0 M | 4.51% | 56.90 | 39.79 | 14.8 |
| LayerNorm tuning | 41 K | 0.046% | 61.22 | 44.13 | 12.3 |
Cross-Architecture Norm Tuning Comparison
| Model | Full-Tune F1/IoU | # Tunables | Norm-Tune F1/IoU | # Tunables |
|---|---|---|---|---|
| SegFormer (MiT-B0) | 59.98/42.85 | 3.7M | 52.82/35.91 | 7.6K |
| U-Net | 54.28/37.27 | 32.5M | 54.82/37.77 | 55K |
| DeepLabv3+ (Res50) | 55.27/38.21 | 42M | 52.93/36.01 | 57K |
| DeepLabv3+ (Res101) | 56.52/39.41 | 61M | 54.09/37.09 | 110K |
| SAC (SAM + LN tuning) | 61.22/44.13 | 41K | — | — |
Zero-Shot Generalization
| Dataset | SAC F1 | SAC IoU | DeepLabv3+ Res101 F1/IoU |
|---|---|---|---|
| Road420 | 64.22 | 47.30 | 62.56 / 46.28 |
| Facade390 | 61.74 | 44.68 | 62.56 / 46.28 |
| Concrete3k | 75.63 | 60.82 | 62.56 / 46.28 |
| Mean ± SD | 67.20 ± 6.05 | 50.93 ± 7.07 | 62.56 ± 9.60 / 46.28 ± 10.73 |
SAC displays the lowest variance across zero-shot tasks, indicating robustness and superior generalization.
4. Computational Efficiency and Generalization Analysis
- Parameter footprint: SAC tunes 41 K parameters ( 0.046 % of SAM), while LoRA/Adapter methods require 3.7–4 M, and full decoder tuning necessitates 3.7 M–61 M. This results in 30–50 % reduction in training time per epoch compared to non-selective adaptation.
- Generalization: SAC achieves the highest cross-domain mean F1 and the lowest standard deviation compared to all benchmarks. This suggests effective suppression of overfitting and superior capacity to segment cracks in unseen environments.
- Efficiency implication: Selective normalization tuning delivers substantial speedup and memory savings, making SAC feasible for deployment in resource-constrained settings.
5. Key Methodological Innovations and Comparison with Prior Art
SAC’s distinguishing methodological characteristic is its use of LayerNorm-only fine-tuning for domain adaptation of SAM:
- Full fine-tuning, LoRA, or Adapter-based PEFT approaches tune considerably larger parameter subsets but do not outperform SAC’s normalization-only approach on large-scale and zero-shot benchmarks.
- SAC surpasses traditional segmentation networks (U-Net, DeepLabv3+, SegFormer) both in segmentation accuracy and computational cost for crack detection.
- Empirical ablation confirms that updating normalization statistics suffices to bridge domain gap and yield state-of-the-art crack segmentation (Rostami et al., 19 Apr 2025).
6. Practical Impact and Deployment Contexts
- SAC’s minimal computational requirements enable rapid retraining and deployment on real-world monitoring platforms where latency, energy and hardware constraints prohibit large-model fine-tuning.
- The model’s robustness in zero-shot tasks is demonstrated on distinct domains including asphalt, masonry, metal, and concrete.
- A plausible implication is that normalization-based adaptation strategies are especially suitable for industrial computer vision, where rapid prototyping and adaptation across diverse imaging domains is required.
7. Limitations and Future Directions
- The empirical results focus on ViT-Base; extending norm-tuning to larger backbone variants or non-Transformer architectures may require further validation.
- SAC does not alter feature representation kernels, which may limit adaptation in extreme domain shifts where structural features of cracks deviate significantly from those seen in pre-training.
- Fine-tuning normalization layers as a standalone strategy may benefit from integration with knowledge distillation or hybrid PEFT approaches for cases where further accuracy or interpretability is needed.
Summary Table: SAC Performance Comparison
| Model | # Tunables | F1 (%) | IoU (%) | Zero-Shot Mean F1 | Zero-Shot std(F1) |
|---|---|---|---|---|---|
| SAC (LayerNorm Tuning) | 41K | 61.22 | 44.13 | 67.20 | 6.05 |
| DeepLabv3+ Res101 | 61M | 56.52 | 39.41 | 62.56 | 9.60 |
| SegFormer (MiT-B0) | 3.7M | 59.98 | 42.85 | 52.82 | — |
This table demonstrates the parameter efficiency and generalization superiority of SAC relative to full and partial fine-tuning approaches in the published literature (Rostami et al., 19 Apr 2025).