Papers
Topics
Authors
Recent
Search
2000 character limit reached

LPCANet: Lightweight Pyramid Cross-Attention Network

Updated 21 January 2026
  • The paper introduces LPCANet, integrating a MobileNetV2 backbone with pyramid depth encoding and multi-head cross-attention to detect rail surface defects.
  • It leverages multi-scale fusion of RGB and depth features through a Lightweight Pyramid Module, Cross-Attention Mechanism, and Spatial Feature Extractor for precise segmentation.
  • Quantitative results demonstrate state-of-the-art improvements in metrics like IoU and MAE across rail and non-rail defect datasets while ensuring real-time performance.

The Lightweight Pyramid Cross-Attention Network (LPCANet) is a specialized multi-modal deep learning architecture designed for efficient and accurate detection of rail surface defects using RGB-D (color and depth) image data. By integrating a lightweight convolutional backbone, a pyramid-based depth feature extractor, multi-scale cross-attention fusion, and a spatial feature enhancer, LPCANet achieves high accuracy with minimal computational burden, offering a practical solution for industrial defect inspection and real-time deployment in edge scenarios (Alex et al., 14 Jan 2026).

1. Architectural Components

LPCANet’s architecture consists of four principal modules: a MobileNetV2 RGB backbone, a Lightweight Pyramid Module (LPM) for depth, a multi-scale Cross-Attention Mechanism (CAM), and a Spatial Feature Extractor (SFE). Each module operates at multiple spatial scales, culminating in a final mask head for defect localization.

MobileNetV2 Backbone: This component is initialized with ImageNet-1K weights and extracts RGB features at four resolutions (H/4×W/4 to H/32×W/32) denoted as F1rF_1^r, F2rF_2^r, F3rF_3^r, F4rF_4^r, with output channels [24, 32, 64, 160] respectively. The backbone’s detailed structure and per-layer parameter/FLOP counts are provided below.

Layer Output Size Params (K) FLOPs (M)
conv1 160×160×32 992 17.7
Bottleneck2 80×80×24 5,376 150
Bottleneck3 40×40×32 9,216 307
Bottleneck4 20×20×64 18,432 1,179
Bottleneck6 10×10×160 69,120 1,152

Lightweight Pyramid Module (LPM): The LPM processes the single-channel depth image IdI^d and mirrors the multiscale extraction of the RGB backbone, producing FidF_i^d with the same spatial resolutions. Its structure consists of a 4×4 projection (stride 4) followed by a cascade of 3×3 convolutions and average pooling, resulting in feature maps with channel progression [64, 128, 256, 512]. The total parameter count for the LPM is ≈2.1M with about 1.9G FLOPs.

Cross-Attention Mechanism (CAM): At each scale ii, CAM fuses the RGB backbone (FirF_i^r) and LPM (FidF_i^d) features using multi-head cross-attention, formulated as: Qir=WQFir,Kid=WKFid,Vid=WVFidQ_i^r = W_Q F_i^r,\quad K_i^d = W_K F_i^d,\quad V_i^d = W_V F_i^d After reshaping to NhN_h heads: Attn(Q^ir,K^id,V^id)=softmax(Q^ir(K^id)Tdz)V^id\text{Attn}(\hat Q_i^r, \hat K_i^d, \hat V_i^d) = \mathrm{softmax}\left(\frac{\hat Q_i^r (\hat K_i^d)^T}{\sqrt{d_z}}\right) \hat V_i^d These attended features are projected and batch-normalized, resulting in FicaF_i^\mathrm{ca}. The total parameter count for CAMs at all scales is ≈4.0M, with ≈0.8G FLOPs.

Spatial Feature Extractor (SFE): Applied at scales 1–3, the SFE processes FicaF_i^\mathrm{ca} via 1×11\times1 convolution, splits channels for horizontal and vertical 1×31\times3/3×13\times1 convolutions (reflecting anisotropic cue extraction), fuses them, and finally projects via another 1×11\times1 convolution. The SFE omits dilated or deformable convolutions, yielding ≈1.2M parameters and 0.4G FLOPs over all relevant scales.

2. Multi-Modal Fusion and Mask Prediction Pipeline

LPCANet processes an RGB-D pair, first extracting hierarchical color and depth features, then fusing them via CAM at each scale, and finally enhancing spatial cues using the SFE. The outputs are passed to upsampling/downsampling and a mask head to generate the final segmentation mask.

Module Input / Output Params (M) FLOPs (G)
Backbone 320×320×3{Fir}320\times320\times3 \to \{F^r_i\} 3.42 1.15
LPM 320×320×1{Fid}320\times320\times1 \to \{F^d_i\} 2.10 0.15
CAM (×4 scales) {Fir,Fid}Fica\{F^r_i,F^d_i\} \to F^{ca}_i 4.00 0.80
SFE (scales 1–3) FicafioutF^{ca}_i \to f^{out}_i 1.20 0.40
Mask Head 320×320320×320×1320\times320\xrightarrow{} 320\times320\times1 0.28 0.00
Total ≈9.90 2.50

The total parameter count is ≈9.90M, with end-to-end FLOPs of 2.5G. The model achieves an inference speed of 162.60 fps for 320×320320\times320 inputs on standard hardware.

3. Training Paradigm

LPCANet is trained with binary cross-entropy loss on the final segmentation mask: LBCE=a,b[pablogqab+(1pab)log(1qab)]\mathcal{L}_\mathrm{BCE} = -\sum_{a,b} \left[p_{ab} \log q_{ab} + (1-p_{ab})\log(1-q_{ab})\right] No IoU or auxiliary losses are used. The optimization employs AdamW with an initial learning rate of 1×1041\times10^{-4}, momentum 0.9, and weight decay 0.05, annealed via a cosine schedule over 50 epochs. Batch size is set to 16. Data augmentation includes random flips, cropping, rotation (±15\pm15^\circ), Gaussian noise, and impulse noise to promote generalization.

Datasets used in supervised and unsupervised settings include NEU-RSDDS-AUG (1,500 train / 362 test), RSDD-TYPE1, and RSDD-TYPE2 for rail defects. Generalization is evaluated on DAGM2007, MT, and Kolektor-SDD2.

4. Quantitative Evaluation and Ablation

Performance metrics for segmentation include SαS_\alpha, Intersection-over-Union (IoU), and Mean Absolute Error (MAE): Sα=αSo+(1α)Sr(α=0.5),IoU=PGPG,MAE=1HWi,jP(i,j)G(i,j)S_\alpha = \alpha S_o + (1-\alpha) S_r \qquad (\alpha=0.5), \quad \mathrm{IoU} = \frac{|P\cap G|}{|P\cup G|}, \quad \mathrm{MAE} = \frac{1}{HW} \sum_{i,j} |P(i,j) - G(i,j)| On NEU-RSDDS-AUG, LPCANet achieves state-of-the-art results, improving over prior methods:

Model mAP IoU MAE Fβ=1F_{\beta=1} EξE_\xi SαS_\alpha
CSEPNet 94.40 82.22 8.88 88.43 92.37 83.10
LPCANet 94.43 83.08 7.11 88.57 92.17 84.58
Δ\Delta +0.03 +0.86 –1.77 +0.14 –0.20 +1.48

Ablation demonstrates the importance of the CAM and SFE modules:

Config CAM SFE Params (M) IoU SαS_\alpha
Baseline 9.20 80.12 81.80
+ CAM only 9.55 82.12 83.62
+ SFE only 9.75 81.54 83.01
LPCANet (full) 9.90 83.08 84.58

On non-rail datasets, LPCANet demonstrates strong generalization, often surpassing or matching the top results.

5. Design Rationale, Generalization, and Practical Utility

LPCANet’s modular design addresses competing requirements for accuracy, efficiency, and industrial deployability. The combination of MobileNetV2’s low computational cost, the LPM’s targeted depth encoding, and cross-attention fusion enables effective integration of multimodal cues within a restricted parameter and FLOP budget. The SFE’s anisotropic convolutional paths target defect structures with specific geometric orientations, common in rail inspection.

The model’s performance on out-of-domain datasets (DAGM2007, MT, Kolektor-SDD2) indicates robust generalization properties. This suggests potential for broader adoption across visual inspection domains beyond rail surfaces.

A plausible implication is that the explicit pyramid structure and absence of dilated/deformable convolutions in SFE facilitate both hardware efficiency and interpretability, which are often crucial in safety-critical industrial scenarios.

6. Limitations and Prospects for Model Compression

While LPCANet achieves notable reductions in computational requirements (9.90M params, 2.50G FLOPs), further advances are anticipated in model compression and real-time, on-device deployment. Planned directions include structured pruning (particularly targeting redundant cross-attention heads and SFE channels), knowledge distillation into even smaller architectures (e.g., MobileNetV3-based students), and 8-bit quantization for both weights and activations while minimizing accuracy loss. These strategies are expected to reduce model size below 5M parameters, enabling truly edge-resident rail inspection systems.

7. Connections to Prior Work and Broader Impact

LPCANet narrows the functional and methodological gap between traditional symbolic vision pipelines and contemporary deep learning approaches by leveraging explicit cross-modal fusion and multiscale feature reasoning. By setting new benchmarks on both in-domain and general-purpose defect datasets, the architecture establishes a reference for scalable, accurate, and computationally efficient industrial inspection models, with expected impact across inspection, quality assurance, and autonomous monitoring sectors (Alex et al., 14 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Lightweight Pyramid Cross-Attention Network (LPCANet).