DSCformer for Crack Segmentation
- DSCformer is a dual-branch deep learning architecture that combines an enhanced Dynamic Snake Convolution module with a SegFormer branch for precise concrete crack segmentation.
- The network fuses multi-scale features using Weighted Convolutional and Spatial Attention Modules to robustly capture fine-grained, tubular crack features in noisy backgrounds.
- Experimental results show DSCformer outperforms prior CNN+Transformer methods with improved IoU, F1 scores, and reduced Hausdorff Distance on standard crack datasets.
DSCformer refers to a dual-branch deep learning architecture designed for crack segmentation in the context of construction quality monitoring. Developed to address the complementary limitations of convolutional neural networks (CNNs) and Transformers, DSCformer integrates an enhanced @@@@1@@@@ (DSConv) module alongside a Transformer-based SegFormer branch to segment cracks in concrete with high precision and robustness, especially against noisy and complex backgrounds (Yu et al., 14 Nov 2024).
1. Network Architecture
DSCformer employs a two-branch encoder and a joint decoder design. The first branch utilizes stacked DSC blocks built on the enhanced Dynamic Snake Convolution, standard convolution, a Weighted Convolutional Attention Module (WCAM), a Spatial Attention Module (SAM), and residual connections. The second branch leverages a pre-trained SegFormer-b0 encoder, which outputs multi-scale feature maps at downsampling ratios of 4, 8, 16, and 32.
The decoder, at each spatial resolution, concatenates features from both branches followed by the application of WCAM and SAM. Two 3×3 convolutions with residual smoothing and an upsampling operation refine the features before producing the output segmentation mask.
The network data flow at each decoder level is summarized as follows:
1 2 3 4 5 |
Input image ↳ DSConv branch → multi-scale features (s=1/1, 1/2, 1/4, 1/8, 1/16) ↳ SegFormer branch → multi-scale features (s=1/4, 1/8, 1/16, 1/32) → At each decoder level: concatenate DSConv(s), SegFormer(s), and upsampled previous scale → WCAM → SAM → convolutions → upsampling → segmentation mask |
2. Enhanced Dynamic Snake Convolution (DSConv)
The DSConv module is designed to efficiently model fine-grained, tubular crack features that are challenging for standard convolutions and Transformers.
- Pyramid Kernel Offsets: Instead of a single 3×3 offset predictor, DSCformer uses four sub-kernels of sizes 3×3, 5×5, 7×7, and 9×9 to generate 2D offsets for increasingly distant points within a 9×9 receptive field. Each sub-kernel outputs Δx and Δy offsets corresponding to its grid size.
- Bi-directional Learnable Offset Iteration: DSConv simultaneously learns both positive and negative displacement steps from each grid center, forming a “snake-chain” of deformable sampling points. For center (x_t, y_t) and layer c:
- Bilinear Interpolation: For sampling at fractional locations, bilinear interpolation is used:
with
This formulation enables adaptive following of thin, curved crack structures as opposed to the rigid sampling of classical convolutions.
3. Weighted Convolutional Attention Module (WCAM)
The WCAM is a refined channel attention mechanism designed to improve on CBAM’s CAM component:
- Pooling and MLPs: Given feature map , global average and max-pooling are performed over spatial dimensions to produce and . Each is processed independently by a two-layer MLP.
- Learnable Channel Weights: The channel attention response is the sigmoid of a weighted sum of the MLP outputs:
where , are learnable, is sigmoid, and denotes element-wise multiplication.
WCAM is placed both within each DSC block (after feature concatenation) and in the decoder after feature fusion from both branches, focusing the network on crack-relevant feature channels before spatial attention and further processing.
4. Training Procedure and Optimization
The model is trained end-to-end with standard segmentation loss functions:
- Loss Function: Composite of cross-entropy loss and Dice loss:
where is the predicted mask, the ground truth.
- Hyperparameters: Training is performed for 100 epochs using the Adam optimizer (learning rate=1e-4, weight decay=1e-4, batch size=8).
- Augmentation: Data augmentation includes resizing to fixed dimensions (e.g., 512×512), random flips, rotations (±15°), and affine transforms (scale 0.8–1.2, shear ±10°).
5. Quantitative and Qualitative Results
Experiments on the Crack3238 and FIND datasets, comparing DSCformer to seven recent baselines, demonstrate its superior segmentation performance.
| Dataset | Method | Params (M) | IoU % | F1 % | Prec % | Rec % | HD (mm) |
|---|---|---|---|---|---|---|---|
| Crack3238 | DcsNet | 15.2 | 56.85 | 70.05 | 69.83 | 73.73 | 37.17 |
| Crack3238 | DSCformer | 14.8 | 58.74 | 71.74 | 72.38 | 73.99 | 33.35 |
| FIND | UCTransNet | 66.5 | 86.05 | 92.30 | 91.15 | 93.98 | 11.90 |
| FIND | DSCformer | 14.8 | 87.31 | 93.04 | 92.52 | 94.01 | 11.14 |
DSCformer achieves an IoU improvement of approximately 1.84% over the best prior CNN+Transformer baseline (DcsNet) on Crack3238 and 1.26% on FIND. Qualitative analysis demonstrates precise delineation of thin cracks, superior handling of background noise and complex textures, and more accurate mapping of wide crack regions compared to competing approaches.
6. Ablation Study
Ablation experiments illustrate the incremental contributions of each DSCformer component:
- Replacing vanilla conv with DSConv and enhanced offsets improves IoU by up to 1.00%.
- WCAM outperforms standard CAM by 0.46% IoU.
- Integrating both branches (DSConv and SegFormer) yields up to 4.75% IoU gain versus single-branch baselines.
7. Significance and Broader Context
DSCformer represents a principled integration of adaptive convolutional mechanisms and transformer-based global context modeling in the domain of visual crack segmentation. Enhanced DSConv enables fine-grained, deformable filtering matched to the geometry of cracks, while transformer-based representations capture wider contextual cues, together addressing key limitations of prior purely convolutional or transformer-only methods. Quantitative and qualitative performance, along with ablation studies, validate the architectural innovations and highlight the potential of dual-branch synergistic models for fine-structure segmentation tasks in complex environments (Yu et al., 14 Nov 2024).