LPCANet: Lightweight Pyramid Cross-Attention Network

Updated 21 January 2026

The paper introduces LPCANet, integrating a MobileNetV2 backbone with pyramid depth encoding and multi-head cross-attention to detect rail surface defects.
It leverages multi-scale fusion of RGB and depth features through a Lightweight Pyramid Module, Cross-Attention Mechanism, and Spatial Feature Extractor for precise segmentation.
Quantitative results demonstrate state-of-the-art improvements in metrics like IoU and MAE across rail and non-rail defect datasets while ensuring real-time performance.

The Lightweight Pyramid Cross-Attention Network (LPCANet) is a specialized multi-modal deep learning architecture designed for efficient and accurate detection of rail surface defects using RGB-D (color and depth) image data. By integrating a lightweight convolutional backbone, a pyramid-based depth feature extractor, multi-scale cross-attention fusion, and a spatial feature enhancer, LPCANet achieves high accuracy with minimal computational burden, offering a practical solution for industrial defect inspection and real-time deployment in edge scenarios (Alex et al., 14 Jan 2026).

1. Architectural Components

LPCANet’s architecture consists of four principal modules: a MobileNetV2 RGB backbone, a Lightweight Pyramid Module (LPM) for depth, a multi-scale Cross-Attention Mechanism (CAM), and a Spatial Feature Extractor (SFE). Each module operates at multiple spatial scales, culminating in a final mask head for defect localization.

MobileNetV2 Backbone: This component is initialized with ImageNet-1K weights and extracts RGB features at four resolutions (H/4×W/4 to H/32×W/32) denoted as $F_1^r$ , $F_2^r$ , $F_3^r$ , $F_4^r$ , with output channels [24, 32, 64, 160] respectively. The backbone’s detailed structure and per-layer parameter/FLOP counts are provided below.

Layer	Output Size	Params (K)	FLOPs (M)
conv1	160×160×32	992	17.7
Bottleneck2	80×80×24	5,376	150
Bottleneck3	40×40×32	9,216	307
Bottleneck4	20×20×64	18,432	1,179
Bottleneck6	10×10×160	69,120	1,152

Lightweight Pyramid Module (LPM): The LPM processes the single-channel depth image $I^d$ and mirrors the multiscale extraction of the RGB backbone, producing $F_i^d$ with the same spatial resolutions. Its structure consists of a 4×4 projection (stride 4) followed by a cascade of 3×3 convolutions and average pooling, resulting in feature maps with channel progression [64, 128, 256, 512]. The total parameter count for the LPM is ≈2.1M with about 1.9G FLOPs.

Cross-Attention Mechanism (CAM): At each scale $i$ , CAM fuses the RGB backbone ( $F_i^r$ ) and LPM ( $F_i^d$ ) features using multi-head cross-attention, formulated as: $Q_i^r = W_Q F_i^r,\quad K_i^d = W_K F_i^d,\quad V_i^d = W_V F_i^d$ After reshaping to $N_h$ heads: $\text{Attn}(\hat Q_i^r, \hat K_i^d, \hat V_i^d) = \mathrm{softmax}\left(\frac{\hat Q_i^r (\hat K_i^d)^T}{\sqrt{d_z}}\right) \hat V_i^d$ These attended features are projected and batch-normalized, resulting in $F_i^\mathrm{ca}$ . The total parameter count for CAMs at all scales is ≈4.0M, with ≈0.8G FLOPs.

Spatial Feature Extractor (SFE): Applied at scales 1–3, the SFE processes $F_i^\mathrm{ca}$ via $1\times1$ convolution, splits channels for horizontal and vertical $1\times3$ / $3\times1$ convolutions (reflecting anisotropic cue extraction), fuses them, and finally projects via another $1\times1$ convolution. The SFE omits dilated or deformable convolutions, yielding ≈1.2M parameters and 0.4G FLOPs over all relevant scales.

LPCANet processes an RGB-D pair, first extracting hierarchical color and depth features, then fusing them via CAM at each scale, and finally enhancing spatial cues using the SFE. The outputs are passed to upsampling/downsampling and a mask head to generate the final segmentation mask.

Module	Input / Output	Params (M)	FLOPs (G)
Backbone	$320\times320\times3 \to \{F^r_i\}$	3.42	1.15
LPM	$320\times320\times1 \to \{F^d_i\}$	2.10	0.15
CAM (×4 scales)	$\{F^r_i,F^d_i\} \to F^{ca}_i$	4.00	0.80
SFE (scales 1–3)	$F^{ca}_i \to f^{out}_i$	1.20	0.40
Mask Head	$320\times320\xrightarrow{} 320\times320\times1$	0.28	0.00
Total	—	≈9.90	2.50

The total parameter count is ≈9.90M, with end-to-end FLOPs of 2.5G. The model achieves an inference speed of 162.60 fps for $320\times320$ inputs on standard hardware.

3. Training Paradigm

LPCANet is trained with binary cross-entropy loss on the final segmentation mask: $\mathcal{L}_\mathrm{BCE} = -\sum_{a,b} \left[p_{ab} \log q_{ab} + (1-p_{ab})\log(1-q_{ab})\right]$ No IoU or auxiliary losses are used. The optimization employs AdamW with an initial learning rate of $1\times10^{-4}$ , momentum 0.9, and weight decay 0.05, annealed via a cosine schedule over 50 epochs. Batch size is set to 16. Data augmentation includes random flips, cropping, rotation ( $\pm15^\circ$ ), Gaussian noise, and impulse noise to promote generalization.

Datasets used in supervised and unsupervised settings include NEU-RSDDS-AUG (1,500 train / 362 test), RSDD-TYPE1, and RSDD-TYPE2 for rail defects. Generalization is evaluated on DAGM2007, MT, and Kolektor-SDD2.

4. Quantitative Evaluation and Ablation

Performance metrics for segmentation include $S_\alpha$ , Intersection-over-Union (IoU), and Mean Absolute Error (MAE): $S_\alpha = \alpha S_o + (1-\alpha) S_r \qquad (\alpha=0.5), \quad \mathrm{IoU} = \frac{|P\cap G|}{|P\cup G|}, \quad \mathrm{MAE} = \frac{1}{HW} \sum_{i,j} |P(i,j) - G(i,j)|$ On NEU-RSDDS-AUG, LPCANet achieves state-of-the-art results, improving over prior methods:

Model	mAP	IoU	MAE	$F_{\beta=1}$	$E_\xi$	$S_\alpha$
CSEPNet	94.40	82.22	8.88	88.43	92.37	83.10
LPCANet	94.43	83.08	7.11	88.57	92.17	84.58
$\Delta$	+0.03	+0.86	–1.77	+0.14	–0.20	+1.48

Ablation demonstrates the importance of the CAM and SFE modules:

Config	CAM	SFE	Params (M)	IoU	$S_\alpha$
Baseline	–	–	9.20	80.12	81.80
+ CAM only	✓	–	9.55	82.12	83.62
+ SFE only	–	✓	9.75	81.54	83.01
LPCANet (full)	✓	✓	9.90	83.08	84.58

On non-rail datasets, LPCANet demonstrates strong generalization, often surpassing or matching the top results.

5. Design Rationale, Generalization, and Practical Utility

LPCANet’s modular design addresses competing requirements for accuracy, efficiency, and industrial deployability. The combination of MobileNetV2’s low computational cost, the LPM’s targeted depth encoding, and cross-attention fusion enables effective integration of multimodal cues within a restricted parameter and FLOP budget. The SFE’s anisotropic convolutional paths target defect structures with specific geometric orientations, common in rail inspection.

The model’s performance on out-of-domain datasets (DAGM2007, MT, Kolektor-SDD2) indicates robust generalization properties. This suggests potential for broader adoption across visual inspection domains beyond rail surfaces.

A plausible implication is that the explicit pyramid structure and absence of dilated/deformable convolutions in SFE facilitate both hardware efficiency and interpretability, which are often crucial in safety-critical industrial scenarios.

6. Limitations and Prospects for Model Compression

While LPCANet achieves notable reductions in computational requirements (9.90M params, 2.50G FLOPs), further advances are anticipated in model compression and real-time, on-device deployment. Planned directions include structured pruning (particularly targeting redundant cross-attention heads and SFE channels), knowledge distillation into even smaller architectures (e.g., MobileNetV3-based students), and 8-bit quantization for both weights and activations while minimizing accuracy loss. These strategies are expected to reduce model size below 5M parameters, enabling truly edge-resident rail inspection systems.

7. Connections to Prior Work and Broader Impact

LPCANet narrows the functional and methodological gap between traditional symbolic vision pipelines and contemporary deep learning approaches by leveraging explicit cross-modal fusion and multiscale feature reasoning. By setting new benchmarks on both in-domain and general-purpose defect datasets, the architecture establishes a reference for scalable, accurate, and computationally efficient industrial inspection models, with expected impact across inspection, quality assurance, and autonomous monitoring sectors (Alex et al., 14 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

LPCAN: Lightweight Pyramid Cross-Attention Network for Rail Surface Defect Detection Using RGB-D Data (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Lightweight Pyramid Cross-Attention Network (LPCANet).