Papers
Topics
Authors
Recent
2000 character limit reached

Convolutional Cargo-Occupancy Classifier

Updated 1 December 2025
  • The classifier employs VGG-style CNNs with rigorous image preprocessing and geometric oversampling to accurately detect cars in X-ray cargo imagery.
  • It addresses severe class imbalance by applying two window extraction strategies, achieving high performance even under adversarial occlusion.
  • Key performance metrics include 100% TPR at 0.22% FPR, with robust deployment potential in non-intrusive inspection systems at transport hubs.

A convolutional cargo-occupancy classifier operationalizes automated detection of specific objects, such as cars, within X-ray imagery of cargo containers. Deployed as part of non-intrusive inspection systems at transport hubs, such classifiers leverage convolutional neural networks (CNNs) for the identification of occupancy patterns in complex settings, including challenging cases with partial or full occlusion. The approach centers on VGG-style deep convolutional architectures, rigorous image preprocessing, geometric oversampling to mitigate positive-class scarcity, and robust evaluation under severe class imbalance and adversarial visual clutter (Jaccard et al., 2016).

1. Dataset Composition and Image Preprocessing

The positive class comprises 79 annotated X-ray images containing 192 cars, stratified across five operational sub-categories: single cars, two-car groupings in 40-foot containers, vertically stacked cars, cars adjacent to heterogenous cargo, and those partially or completely occluded. The negative class contains 30,000 randomly-sampled “Stream-of-Commerce” (SoC) images, subdivided into three exclusive partitions of 10,000 images for training, validation, and test, respectively. Approximately 20% of the negative samples are empty containers, with the remainder spanning pallets, bulk materials, industrial machinery, vans, motorbikes, and other non-car cargo.

All images are 16-bit greyscale at resolutions ranging from 1290×850 to 2570×850 pixels (≈6 mm horizontal pixel size). The preprocessing pipeline consists of:

  1. Black-stripe removal: Detection and excision of columns entirely zeroed by detector faults or X-ray source misfires.
  2. Column-wise normalisation: Rescaling of each column to align with air-attenuation statistics, compensating for source intensity variation and sensor non-uniformity.
  3. Impulse-noise removal: Median filtering to replace isolated bright or dark pixels with local medians.
  4. Log-transform (optional): Application of I(x,y)=logI(x,y)I'(x,y) = \log I(x,y) to enhance interpretability and marginally increase CNN performance.

2. Positive-window Oversampling Strategy

Severe class imbalance, with only 192 positive (car) instances, necessitates aggressive geometric oversampling. For each image, a region-of-interest (ROI) mask RR (hand-drawn) denotes car locations. Windows ww of fixed size are extracted and labeled positive if the proportion of overlap satisfies:

wRwtROI\frac{|w \cap R|}{|w|} \geq t_{ROI}

where | \cdot | denotes pixel count. Two window configurations are employed:

  • Square: W=H=512W=H=512 px, tROI=0.50t_{ROI}=0.50 (≈140-fold oversampling increase)
  • Rectangular: W=1050W=1050, H=350H=350 px (mean car aspect ratio), tROI=0.65t_{ROI}=0.65 (≈50-fold increase)

All windows, both positive and negative, are sampled at a 32 px stride during training to facilitate GPU batch-filling.

3. Convolutional Neural Network Architecture

Two VGG-style deep CNNs—an 18-layer (16 convolutional, 3 fully connected) and a shallower 11-layer variant—are trained from scratch using log-transformed 256×256256 \times 256 image crops. Core architectural features include:

  • All convolutional layers: 3×33 \times 3 filters, stride =1=1, padding =1=1, ReLU activation, batch normalization; no dropout.
  • Max pooling follows select convolutional blocks, halving spatial dimensions.
  • Fully connected layers: FC1 and FC2 with 4096 units, batch normalization, and ReLU; FC3 outputs 2-class softmax.
  • Comparison is provided via a pre-trained VGG-19 (ImageNet), where single-channel cargo images are trivially replicated into RGB and resized to 224×224224 \times 224.
Layer Name Output Size Operation Details
Input 256×256×1256\times256\times1 Mean subtracted
Conv3-64 ×2, BN, ReLU 256×256×64256\times256\times64
MaxPool (2×2, s=2) 128×128×64128\times128\times64
Conv3-128 ×2, BN, ReLU 128×128×128128\times128\times128
MaxPool 64×64×12864\times64\times128
Conv3-256 ×2, BN, ReLU 64×64×25664\times64\times256
MaxPool 32×32×25632\times32\times256
Conv3-512 ×2, BN, ReLU 32×32×51232\times32\times512
MaxPool 16×16×51216\times16\times512
Conv3-512 ×2, BN, ReLU 16×16×51216\times16\times512
MaxPool 8×8×5128\times8\times512
FC1 (4096), BN, ReLU $4096$
FC2 (4096), BN, ReLU $4096$
FC3 (2), Softmax $2$

4. Training Protocol

Networks are trained via stochastic gradient descent (SGD) with Nesterov momentum (μ0.9\mu \approx 0.9), weight decay λ=5×104\lambda = 5 \times 10^{-4}, and a batch size of 64. The learning rate is initialized at η0=104\eta_0 = 10^{-4} and reduced to η1=105\eta_1 = 10^{-5} upon validation loss plateau. Training proceeds until no further improvement on validation data, typically requiring 20–30 epochs.

The cost function combines softmax cross-entropy with L2 regularization:

L=i,cyi,clogpi,c+λW22L = -\sum_{i, c} y_{i, c} \log p_{i, c} + \lambda\|W\|_2^2

where yi,cy_{i, c} is the one-hot ground-truth label, pi,cp_{i, c} the output probability, and WW the parameters.

5. Evaluation Metrics and Performance

Performance assessment employs:

  • True Positive Rate (TPR): TPR=TPTP+FN\textrm{TPR} = \frac{TP}{TP+FN}
  • False Positive Rate (FPR): FPR=FPFP+TN\textrm{FPR} = \frac{FP}{FP+TN}
  • Accuracy, ROC AUC, and H-measure (a β\beta-weighted AUC variant for class imbalance)

Cross-validation uses leave-one-out (LOOCV) for positives and hold-out for negatives. The threshold tCARt_{CAR} is tuned using the negative validation set and LOOCV positives.

Key results:

  • 18-layer CNN with features from FC1 combined with a Random Forest yields 100% car image classification (TPR = 1.00) at FPR = 0.22% (approx. 1 per 454 non-car images).
  • ROC AUC 0.997\approx 0.997, H-measure 0.995\approx 0.995.
  • Pre-trained VGG-19 (with SVM from FC2 activations) gives FPR = 0.34%, indicating some transferability of natural-image features.
  • The detector maintains pI0.99p_{I} \geq 0.99 under strong synthetic occlusion up to ROI attenuation δ0.8\delta \approx 0.8, only failing for near-total occlusion.

6. Robustness, Deployment, and Generalization

Synthetic adversarial occlusion—where car images are systematically overlaid with patches from real non-car cargo until total ROI attenuation—demonstrates robustness up to extreme visual clutter (δ0.8\delta \approx 0.8). Detection performance only degrades under implausibly extreme concealments, such as full coverage with dense materials.

Full image inference (including preprocessing, window sampling with 64-px stride, CNN forward passes, and random forest classification) requires approximately 2.6 seconds per image on an Intel Xeon E5-1620 CPU, Titan X GPU, and 32 GB RAM. Optimizations such as C++/CUDA implementation, window batching, and multi-GPU parallelism can reduce this to sub-1-second latency, supporting high-throughput deployment. The system is GPU-accelerated but capable of CPU-only operation at lower throughput.

The workflow is readily extensible to other object classes (e.g., pallets, machinery, occupancy detection), requiring ROI annotation, window oversampling, and training with a minimum of 50–100 positive samples plus a shared negative pool.

7. Significance and Application Scope

Automated detection of cars within cargo X-ray imagery addresses operational bottlenecks in customs inspection by reducing dependence on manual image review and operator throughput. The classifier achieves operationally viable performance in both ideal and adversarial settings, including recognition of cars that are stacked, positioned with other bulk cargo, or deliberately concealed. The architecture and training protocol generalize to multi-class cargo occupancy scenarios, contingent on annotation availability and positive-sample oversampling (Jaccard et al., 2016).

A plausible implication is that, given its robustness to occlusion and high throughput under GPU acceleration, this approach provides a scalable template for broader security-screening automation in cargo logistics and regulatory compliance.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Convolutional Cargo-Occupancy Classifier.