DSCformer for Crack Segmentation

Updated 30 December 2025

DSCformer is a dual-branch deep learning architecture that combines an enhanced Dynamic Snake Convolution module with a SegFormer branch for precise concrete crack segmentation.
The network fuses multi-scale features using Weighted Convolutional and Spatial Attention Modules to robustly capture fine-grained, tubular crack features in noisy backgrounds.
Experimental results show DSCformer outperforms prior CNN+Transformer methods with improved IoU, F1 scores, and reduced Hausdorff Distance on standard crack datasets.

DSCformer refers to a dual-branch deep learning architecture designed for crack segmentation in the context of construction quality monitoring. Developed to address the complementary limitations of convolutional neural networks (CNNs) and Transformers, DSCformer integrates an enhanced @@@@1@@@@ (DSConv) module alongside a Transformer-based SegFormer branch to segment cracks in concrete with high precision and robustness, especially against noisy and complex backgrounds (Yu et al., 14 Nov 2024).

1. Network Architecture

DSCformer employs a two-branch encoder and a joint decoder design. The first branch utilizes stacked DSC blocks built on the enhanced Dynamic Snake Convolution, standard convolution, a Weighted Convolutional Attention Module (WCAM), a Spatial Attention Module (SAM), and residual connections. The second branch leverages a pre-trained SegFormer-b0 encoder, which outputs multi-scale feature maps at downsampling ratios of 4, 8, 16, and 32.

The decoder, at each spatial resolution, concatenates features from both branches followed by the application of WCAM and SAM. Two 3×3 convolutions with residual smoothing and an upsampling operation refine the features before producing the output segmentation mask.

The network data flow at each decoder level is summarized as follows:

Input image
  ↳ DSConv branch → multi-scale features (s=1/1, 1/2, 1/4, 1/8, 1/16)
  ↳ SegFormer branch → multi-scale features (s=1/4, 1/8, 1/16, 1/32)
→ At each decoder level: concatenate DSConv(s), SegFormer(s), and upsampled previous scale
→ WCAM → SAM → convolutions → upsampling → segmentation mask

2. Enhanced Dynamic Snake Convolution (DSConv)

The DSConv module is designed to efficiently model fine-grained, tubular crack features that are challenging for standard convolutions and Transformers.

Pyramid Kernel Offsets: Instead of a single 3×3 offset predictor, DSCformer uses four sub-kernels of sizes 3×3, 5×5, 7×7, and 9×9 to generate 2D offsets for increasingly distant points within a 9×9 receptive field. Each sub-kernel outputs Δx and Δy offsets corresponding to its grid size.
Bi-directional Learnable Offset Iteration: DSConv simultaneously learns both positive and negative displacement steps from each grid center, forming a “snake-chain” of deformable sampling points. For center (x_t, y_t) and layer c:

$(x_{t+c}, y_{t+c}) = \Bigl(x_t + \sum_{i=1}^c \Delta x_i,\; y_t + \sum_{i=1}^c \Delta y_i\Bigr)$

$(x_{t-c}, y_{t-c}) = \Bigl(x_t - \sum_{i=1}^c \Delta x_{-i},\; y_t - \sum_{i=1}^c \Delta y_{-i}\Bigr)$

Bilinear Interpolation: For sampling at fractional locations, bilinear interpolation is used:

$F(x, y) = \sum_{i, j} G(x-i, y-j)\; F(i, j)$

with

$G(\Delta x, \Delta y) = g(\Delta x)\,g(\Delta y),\quad g(\Delta) = \max(0, 1 - |\Delta|)$

This formulation enables adaptive following of thin, curved crack structures as opposed to the rigid sampling of classical convolutions.

3. Weighted Convolutional Attention Module (WCAM)

The WCAM is a refined channel attention mechanism designed to improve on CBAM’s CAM component:

Pooling and MLPs: Given feature map $F \in \mathbb{R}^{C \times H \times W}$ , global average and max-pooling are performed over spatial dimensions to produce $F^c_{avg}$ and $F^c_{max}$ . Each is processed independently by a two-layer MLP.
Learnable Channel Weights: The channel attention response is the sigmoid of a weighted sum of the MLP outputs:

$M_c(F) = \sigma\left(w_{avg} \odot M_{avg}(F^c_{avg}) + w_{max} \odot M_{max}(F^c_{max})\right)$

where $w_{avg}$ , $w_{max}$ are learnable, $\sigma$ is sigmoid, and $\odot$ denotes element-wise multiplication.

WCAM is placed both within each DSC block (after feature concatenation) and in the decoder after feature fusion from both branches, focusing the network on crack-relevant feature channels before spatial attention and further processing.

4. Training Procedure and Optimization

The model is trained end-to-end with standard segmentation loss functions:

Loss Function: Composite of cross-entropy loss and Dice loss:

$\text{DiceLoss} = 1 - \frac{2|P \cap G|}{|P| + |G|}$

where $P$ is the predicted mask, $G$ the ground truth.

Hyperparameters: Training is performed for 100 epochs using the Adam optimizer (learning rate=1e-4, weight decay=1e-4, batch size=8).
Augmentation: Data augmentation includes resizing to fixed dimensions (e.g., 512×512), random flips, rotations (±15°), and affine transforms (scale 0.8–1.2, shear ±10°).

5. Quantitative and Qualitative Results

Experiments on the Crack3238 and FIND datasets, comparing DSCformer to seven recent baselines, demonstrate its superior segmentation performance.

Dataset	Method	Params (M)	IoU %	F1 %	Prec %	Rec %	HD (mm)
Crack3238	DcsNet	15.2	56.85	70.05	69.83	73.73	37.17
Crack3238	DSCformer	14.8	58.74	71.74	72.38	73.99	33.35
FIND	UCTransNet	66.5	86.05	92.30	91.15	93.98	11.90
FIND	DSCformer	14.8	87.31	93.04	92.52	94.01	11.14

DSCformer achieves an IoU improvement of approximately 1.84% over the best prior CNN+Transformer baseline (DcsNet) on Crack3238 and 1.26% on FIND. Qualitative analysis demonstrates precise delineation of thin cracks, superior handling of background noise and complex textures, and more accurate mapping of wide crack regions compared to competing approaches.

6. Ablation Study

Ablation experiments illustrate the incremental contributions of each DSCformer component:

Replacing vanilla conv with DSConv and enhanced offsets improves IoU by up to 1.00%.
WCAM outperforms standard CAM by 0.46% IoU.
Integrating both branches (DSConv and SegFormer) yields up to 4.75% IoU gain versus single-branch baselines.

7. Significance and Broader Context

DSCformer represents a principled integration of adaptive convolutional mechanisms and transformer-based global context modeling in the domain of visual crack segmentation. Enhanced DSConv enables fine-grained, deformable filtering matched to the geometry of cracks, while transformer-based representations capture wider contextual cues, together addressing key limitations of prior purely convolutional or transformer-only methods. Quantitative and qualitative performance, along with ablation studies, validate the architectural innovations and highlight the potential of dual-branch synergistic models for fine-structure segmentation tasks in complex environments (Yu et al., 14 Nov 2024).

PDF Markdown Chat (Pro)

References (1)

DSCformer: A Dual-Branch Network Integrating Enhanced Dynamic Snake Convolution and SegFormer for Crack Segmentation (2024)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to DSCformer.