LPCANet: Lightweight Pyramid Cross-Attention Network
- The paper introduces LPCANet, integrating a MobileNetV2 backbone with pyramid depth encoding and multi-head cross-attention to detect rail surface defects.
- It leverages multi-scale fusion of RGB and depth features through a Lightweight Pyramid Module, Cross-Attention Mechanism, and Spatial Feature Extractor for precise segmentation.
- Quantitative results demonstrate state-of-the-art improvements in metrics like IoU and MAE across rail and non-rail defect datasets while ensuring real-time performance.
The Lightweight Pyramid Cross-Attention Network (LPCANet) is a specialized multi-modal deep learning architecture designed for efficient and accurate detection of rail surface defects using RGB-D (color and depth) image data. By integrating a lightweight convolutional backbone, a pyramid-based depth feature extractor, multi-scale cross-attention fusion, and a spatial feature enhancer, LPCANet achieves high accuracy with minimal computational burden, offering a practical solution for industrial defect inspection and real-time deployment in edge scenarios (Alex et al., 14 Jan 2026).
1. Architectural Components
LPCANet’s architecture consists of four principal modules: a MobileNetV2 RGB backbone, a Lightweight Pyramid Module (LPM) for depth, a multi-scale Cross-Attention Mechanism (CAM), and a Spatial Feature Extractor (SFE). Each module operates at multiple spatial scales, culminating in a final mask head for defect localization.
MobileNetV2 Backbone: This component is initialized with ImageNet-1K weights and extracts RGB features at four resolutions (H/4×W/4 to H/32×W/32) denoted as , , , , with output channels [24, 32, 64, 160] respectively. The backbone’s detailed structure and per-layer parameter/FLOP counts are provided below.
| Layer | Output Size | Params (K) | FLOPs (M) |
|---|---|---|---|
| conv1 | 160×160×32 | 992 | 17.7 |
| Bottleneck2 | 80×80×24 | 5,376 | 150 |
| Bottleneck3 | 40×40×32 | 9,216 | 307 |
| Bottleneck4 | 20×20×64 | 18,432 | 1,179 |
| Bottleneck6 | 10×10×160 | 69,120 | 1,152 |
Lightweight Pyramid Module (LPM): The LPM processes the single-channel depth image and mirrors the multiscale extraction of the RGB backbone, producing with the same spatial resolutions. Its structure consists of a 4×4 projection (stride 4) followed by a cascade of 3×3 convolutions and average pooling, resulting in feature maps with channel progression [64, 128, 256, 512]. The total parameter count for the LPM is ≈2.1M with about 1.9G FLOPs.
Cross-Attention Mechanism (CAM): At each scale , CAM fuses the RGB backbone () and LPM () features using multi-head cross-attention, formulated as: After reshaping to heads: These attended features are projected and batch-normalized, resulting in . The total parameter count for CAMs at all scales is ≈4.0M, with ≈0.8G FLOPs.
Spatial Feature Extractor (SFE): Applied at scales 1–3, the SFE processes via convolution, splits channels for horizontal and vertical / convolutions (reflecting anisotropic cue extraction), fuses them, and finally projects via another convolution. The SFE omits dilated or deformable convolutions, yielding ≈1.2M parameters and 0.4G FLOPs over all relevant scales.
2. Multi-Modal Fusion and Mask Prediction Pipeline
LPCANet processes an RGB-D pair, first extracting hierarchical color and depth features, then fusing them via CAM at each scale, and finally enhancing spatial cues using the SFE. The outputs are passed to upsampling/downsampling and a mask head to generate the final segmentation mask.
| Module | Input / Output | Params (M) | FLOPs (G) |
|---|---|---|---|
| Backbone | 3.42 | 1.15 | |
| LPM | 2.10 | 0.15 | |
| CAM (×4 scales) | 4.00 | 0.80 | |
| SFE (scales 1–3) | 1.20 | 0.40 | |
| Mask Head | 0.28 | 0.00 | |
| Total | — | ≈9.90 | 2.50 |
The total parameter count is ≈9.90M, with end-to-end FLOPs of 2.5G. The model achieves an inference speed of 162.60 fps for inputs on standard hardware.
3. Training Paradigm
LPCANet is trained with binary cross-entropy loss on the final segmentation mask: No IoU or auxiliary losses are used. The optimization employs AdamW with an initial learning rate of , momentum 0.9, and weight decay 0.05, annealed via a cosine schedule over 50 epochs. Batch size is set to 16. Data augmentation includes random flips, cropping, rotation (), Gaussian noise, and impulse noise to promote generalization.
Datasets used in supervised and unsupervised settings include NEU-RSDDS-AUG (1,500 train / 362 test), RSDD-TYPE1, and RSDD-TYPE2 for rail defects. Generalization is evaluated on DAGM2007, MT, and Kolektor-SDD2.
4. Quantitative Evaluation and Ablation
Performance metrics for segmentation include , Intersection-over-Union (IoU), and Mean Absolute Error (MAE): On NEU-RSDDS-AUG, LPCANet achieves state-of-the-art results, improving over prior methods:
| Model | mAP | IoU | MAE | |||
|---|---|---|---|---|---|---|
| CSEPNet | 94.40 | 82.22 | 8.88 | 88.43 | 92.37 | 83.10 |
| LPCANet | 94.43 | 83.08 | 7.11 | 88.57 | 92.17 | 84.58 |
| +0.03 | +0.86 | –1.77 | +0.14 | –0.20 | +1.48 |
Ablation demonstrates the importance of the CAM and SFE modules:
| Config | CAM | SFE | Params (M) | IoU | |
|---|---|---|---|---|---|
| Baseline | – | – | 9.20 | 80.12 | 81.80 |
| + CAM only | ✓ | – | 9.55 | 82.12 | 83.62 |
| + SFE only | – | ✓ | 9.75 | 81.54 | 83.01 |
| LPCANet (full) | ✓ | ✓ | 9.90 | 83.08 | 84.58 |
On non-rail datasets, LPCANet demonstrates strong generalization, often surpassing or matching the top results.
5. Design Rationale, Generalization, and Practical Utility
LPCANet’s modular design addresses competing requirements for accuracy, efficiency, and industrial deployability. The combination of MobileNetV2’s low computational cost, the LPM’s targeted depth encoding, and cross-attention fusion enables effective integration of multimodal cues within a restricted parameter and FLOP budget. The SFE’s anisotropic convolutional paths target defect structures with specific geometric orientations, common in rail inspection.
The model’s performance on out-of-domain datasets (DAGM2007, MT, Kolektor-SDD2) indicates robust generalization properties. This suggests potential for broader adoption across visual inspection domains beyond rail surfaces.
A plausible implication is that the explicit pyramid structure and absence of dilated/deformable convolutions in SFE facilitate both hardware efficiency and interpretability, which are often crucial in safety-critical industrial scenarios.
6. Limitations and Prospects for Model Compression
While LPCANet achieves notable reductions in computational requirements (9.90M params, 2.50G FLOPs), further advances are anticipated in model compression and real-time, on-device deployment. Planned directions include structured pruning (particularly targeting redundant cross-attention heads and SFE channels), knowledge distillation into even smaller architectures (e.g., MobileNetV3-based students), and 8-bit quantization for both weights and activations while minimizing accuracy loss. These strategies are expected to reduce model size below 5M parameters, enabling truly edge-resident rail inspection systems.
7. Connections to Prior Work and Broader Impact
LPCANet narrows the functional and methodological gap between traditional symbolic vision pipelines and contemporary deep learning approaches by leveraging explicit cross-modal fusion and multiscale feature reasoning. By setting new benchmarks on both in-domain and general-purpose defect datasets, the architecture establishes a reference for scalable, accurate, and computationally efficient industrial inspection models, with expected impact across inspection, quality assurance, and autonomous monitoring sectors (Alex et al., 14 Jan 2026).