HitoMi-Cam: Spectral Person Detection
- HitoMi-Cam is a lightweight, shape-agnostic person detection system that leverages per-pixel spectral signatures from clothing for robust performance.
- It employs a two-tier architecture with offline spectral model training and online MLP inference on Raspberry Pi, achieving 93.5% AP at 23.2 fps in simulated SAR scenarios.
- The method complements conventional CNN detectors by minimizing false positives and maintaining efficiency in unpredictable, extreme postures without GPU acceleration.
HitoMi-Cam is a lightweight, shape-agnostic person detection system utilizing the spectral reflectance properties of clothing, evaluated for real-time use on edge hardware without GPU acceleration. The method departs from standard convolutional neural network (CNN)-based object detectors, which are typically dependent on spatial or shape priors and as such, exhibit degraded performance when faced with postures or object geometries not represented in their training data. HitoMi-Cam instead leverages per-pixel spectral signatures, providing robust detection even in search and rescue (SAR) or other scenarios characterized by unpredictable human shapes and occlusions. The approach specifically achieves 23.2 frames-per-second (fps) at 253×190 pixel resolution, attaining average precision (AP) of 93.5% in simulated SAR scenarios—substantially higher than the best-performing CNN comparator (AP 53.8%). The system features minimal false positives across all evaluated environments and serves as a complementary modality to established CNN pipelines.
1. System Architecture and Processing Pipeline
HitoMi-Cam comprises a two-tier architecture encompassing offline spectral model learning and online lightweight inference. The offline component operates on a GPU-equipped workstation, processing a hyperspectral dataset of 84 clothing textiles and 35 background materials, each sampled over 167 bands spanning the visible to near-infrared (VIS–NIR) spectrum.
Band Selection: A combinatorial search determines four optimal band centroids for discrimination: nm, nm, nm, nm. Classifier: The core model is a lightweight multilayer perceptron (MLP) with two hidden layers (16 and 8 ReLU-activated units, respectively), yielding a 49-dimensional softmax output representing clothing/background subclasses (41 effective classes). Adam optimizer (learning rate ) with early stopping is used for training, augmented by luminance scaling and background variation. The trained classifier is exported in ONNX format.
The online inference tier runs on a Raspberry Pi 5 paired with a PiTOMBO compound-eye 4-band camera, which integrates narrow bandpass filters for the aforementioned wavelengths (bandwidths: 457 ± 18, 565 ± 12, 645 ± 10.5, 735 ± 14.5 nm, field of view ~40°). The camera delivers 253×190 pixel frames (2×2 binning mode).
Processing steps:
- Raw frame acquisition
- White-balance calibration (coefficients per-scene via reference white plate)
- Extraction of per-pixel 4D luminance vectors
- ONNX Runtime-based per-pixel MLP inference yielding 49-class scores
- Clothing/background mask generation (via argmax for clothing classes)
- Noise suppression, morphological closing, connected component analysis yielding bounding-box outputs.
Pipeline Sketch:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
load ONNX_model while camera streaming: raw4 = capture_4band() wb4 = white_balance(raw4, coeffs) mask = zeros(resolution) for each pixel p: x = wb4[p] # 4-vector s = MLP_ONNX.infer(x) # 49-vector if argmax(s) ∈ clothing: mask[p] = 1 mask = remove_noise(mask) mask = morphology_close(mask) boxes = find_connected_bboxes(mask) output(boxes) # each with confidence = 1.0 |
2. Physical and Signal Processing Foundations
HitoMi-Cam detection is rooted in the optical properties of textiles and their interaction with ambient illumination. Textile types (polyester, cotton, wool) exhibit distinctive reflectance spectra under outdoor lighting conditions in both visible and NIR regimes.
Measurement Model:
For each pixel and each filter , the measured band value is: where is spectral irradiance, the filter transmission, and the sensor quantum efficiency. After application of empirically determined white-balance weights , the channel luminances are
Pixel-wise Classification:
Each pixel’s 4D vector is independently classified, eliminating any reliance on spatial continuity or global shape hypotheses.
3. Mathematical Formulations and Computational Algorithms
Spectral Sampling Approximation:
The reflectance spectrum near is discretely integrated: where indexes discrete spectral samples.
MLP Classifier Structure:
The inferred class is , with final decision mask: No spatial context is applied beyond subsequent morphological and connected-components postprocessing.
4. Practical Implementation on Edge Hardware
The deployed platform comprises an Asahi PiTOMBO compound-eye camera and a Raspberry Pi 5 (BCM2712, 4 cores @2.4 GHz, 8 GB RAM), operating entirely without GPU resources.
Software Stack:
- MLP inference via ONNX Runtime
- Pre/post-processing via Python and OpenCV
- White-balance coefficient extraction per acquisition environment
Performance Profile:
- Resolution: 253×190 pixels (2×2 binning)
- Mean per-frame timings:
- RAW capture & preprocessing: 17.8 ms
- MLP inference: 13.4 ms
- Morphological and bounding box extraction: 9.5 ms
- Miscellaneous (I/O, overhead): 2.4 ms
- Total: 43.1 ms (23.2 fps end-to-end)
This operational profile establishes the suitability of HitoMi-Cam for real-time, battery-powered deployment scenarios.
5. Quantitative Performance and Comparative Assessment
Metrics:
- IoU-based detection (threshold )
- AP, precision, recall as standard
Empirical Results (Table 1):
| Model | General Scene | Simulated SAR Scene | Swing Scene |
|---|---|---|---|
| HitoMi-Cam | 0.340 | 0.935 | 0.957 |
| EfficientDet-L0 | 0.780 | 0.370 | 0.733 |
| MobileNet-V1 | 0.688 | 0.409 | 0.459 |
| YOLOv5n | 0.864 | 0.506 | 0.762 |
| YOLOv5s | 0.936 | 0.520 | 0.930 |
| YOLOv5m | 0.961 | 0.524 | 0.974 |
| YOLOv5l | 0.968 | 0.538 | 0.970 |
| YOLOv5x | 0.978 | 0.536 | 0.976 |
In “General Scene” (upright, canonical human poses), YOLOv5 variants and other CNNs outperform HitoMi-Cam due to their superior outline modeling, with HitoMi-Cam registering an AP of 0.34. In contrast, for “Simulated SAR Scene” — encompassing non-canonical postures, occlusions, and ground-placed garments — HitoMi-Cam (AP 0.935) surpasses all CNN baselines (best CNN AP 0.538). In “Swing Scene” (dynamic, blurred, extreme postures), HitoMi-Cam (AP 0.957) matches or exceeds the best CNNs and outperforms lightweight models by a substantial margin.
Across all scenarios, HitoMi-Cam yields the lowest absolute false positive count, benefiting from its physical-material discrimination criterion. Processing throughput remains high (23.2 fps), with negligible degradation compared to CNNs, which slow substantially on CPU in the same contexts.
6. Failure Modes, Limitations, and Complementary Roles
Limitations and Failure Cases:
- Reliance on clothing as a proxy for persons fails in cases of minimal or unconventional garments.
- Spectral confusion arises with certain backgrounds (e.g., dry vegetation), causing misclassification due to similar reflectance.
- Performance is contingent on broadband daylight; efficacy under nighttime or atypical lighting remains untested.
- Environmental occlusion, such as mud or dust, negatively impacts reliability.
- Non-clothing objects (colored tarps, plastics) sharing spectral similarity may register as false positives.
- Pixelwise segmentation yields bounding boxes that systematically underestimate ground-truth object extents, resulting in lower Intersection-over-Union (IoU) than tightly outlined methods.
Known Failure Scenarios:
- In standard scenes, only partial segmentation of body regions; merged detections for proximal individuals leading to person undercounts.
Complementary Use-Cases:
- HitoMi-Cam is not proposed as a comprehensive replacement for CNN-based detectors but as an adjunct for environments with poorly-constrained object geometry, including SAR, complex poses, and rapid inference conditions.
- Candidate deployments include front-end filtering, providing region proposals to subsequent shape-based CNNs or human operators, enhancing overall detection robustness and computational efficiency (cascaded fusion).
- The system can be fused with active IR or thermal modalities for scenarios lacking shape cues (e.g., low light).
7. Conclusion
HitoMi-Cam establishes that a purely spectral, pixelwise learning-based classifier is capable of real-time person detection under physically challenging and shape-divergent conditions, operating at 23 fps on commodity edge hardware. The method achieves low false positive rates and high AP in SAR and extreme-pose scenarios, addressing limitations of conventional CNNs that depend on spatial priors. By functioning as a material-aware prefilter or complementary channel, HitoMi-Cam expands the operational range of automated person detection systems, particularly where shape unpredictability or computational constraints are decisive factors.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free