HitoMi-Cam: Spectral Person Detection

Updated 14 November 2025

HitoMi-Cam is a lightweight, shape-agnostic person detection system that leverages per-pixel spectral signatures from clothing for robust performance.
It employs a two-tier architecture with offline spectral model training and online MLP inference on Raspberry Pi, achieving 93.5% AP at 23.2 fps in simulated SAR scenarios.
The method complements conventional CNN detectors by minimizing false positives and maintaining efficiency in unpredictable, extreme postures without GPU acceleration.

HitoMi-Cam is a lightweight, shape-agnostic person detection system utilizing the spectral reflectance properties of clothing, evaluated for real-time use on edge hardware without GPU acceleration. The method departs from standard convolutional neural network (CNN)-based object detectors, which are typically dependent on spatial or shape priors and as such, exhibit degraded performance when faced with postures or object geometries not represented in their training data. HitoMi-Cam instead leverages per-pixel spectral signatures, providing robust detection even in search and rescue (SAR) or other scenarios characterized by unpredictable human shapes and occlusions. The approach specifically achieves 23.2 frames-per-second (fps) at 253×190 pixel resolution, attaining average precision (AP) of 93.5% in simulated SAR scenarios—substantially higher than the best-performing CNN comparator (AP 53.8%). The system features minimal false positives across all evaluated environments and serves as a complementary modality to established CNN pipelines.

1. System Architecture and Processing Pipeline

HitoMi-Cam comprises a two-tier architecture encompassing offline spectral model learning and online lightweight inference. The offline component operates on a GPU-equipped workstation, processing a hyperspectral dataset of 84 clothing textiles and 35 background materials, each sampled over 167 bands spanning the visible to near-infrared (VIS–NIR) spectrum.

Band Selection: A combinatorial search determines four optimal band centroids for discrimination: $\lambda_1 = 457$ nm, $\lambda_2 = 565$ nm, $\lambda_3 = 645$ nm, $\lambda_4 = 735$ nm. Classifier: The core model is a lightweight multilayer perceptron (MLP) with two hidden layers (16 and 8 ReLU-activated units, respectively), yielding a 49-dimensional softmax output representing clothing/background subclasses ( $\approx$ 41 effective classes). Adam optimizer (learning rate $1\times10^{-3}$ ) with early stopping is used for training, augmented by luminance scaling and background variation. The trained classifier is exported in ONNX format.

The online inference tier runs on a Raspberry Pi 5 paired with a PiTOMBO compound-eye 4-band camera, which integrates narrow bandpass filters for the aforementioned wavelengths (bandwidths: 457 ± 18, 565 ± 12, 645 ± 10.5, 735 ± 14.5 nm, field of view ~40°). The camera delivers 253×190 pixel frames (2×2 binning mode).

Processing steps:

Raw frame acquisition
White-balance calibration (coefficients per-scene via reference white plate)
Extraction of per-pixel 4D luminance vectors
ONNX Runtime-based per-pixel MLP inference yielding 49-class scores
Clothing/background mask generation (via argmax for clothing classes)
Noise suppression, morphological closing, connected component analysis yielding bounding-box outputs.

Pipeline Sketch:

load ONNX_model
while camera streaming:
  raw4 = capture_4band()
  wb4  = white_balance(raw4, coeffs)
  mask = zeros(resolution)
  for each pixel p:
    x = wb4[p]                # 4-vector
    s = MLP_ONNX.infer(x)     # 49-vector
    if argmax(s) ∈ clothing:
      mask[p] = 1
  mask = remove_noise(mask)
  mask = morphology_close(mask)
  boxes = find_connected_bboxes(mask)
  output(boxes)               # each with confidence = 1.0

2. Physical and Signal Processing Foundations

HitoMi-Cam detection is rooted in the optical properties of textiles and their interaction with ambient illumination. Textile types (polyester, cotton, wool) exhibit distinctive reflectance spectra $R(\lambda)$ under outdoor lighting conditions in both visible and NIR regimes.

Measurement Model:

For each pixel and each filter $i$ , the measured band value is: $b_i = \int_{0}^{\infty} E(\lambda) R(\lambda) T_i(\lambda) S(\lambda) d\lambda \qquad (i = 1 \ldots 4)$ where $E(\lambda)$ is spectral irradiance, $T_i(\lambda)$ the filter transmission, and $S(\lambda)$ the sensor quantum efficiency. After application of empirically determined white-balance weights $w_i$ , the channel luminances are

$I_i = w_i b_i$

Pixel-wise Classification:

Each pixel’s 4D vector $x = [I_1, I_2, I_3, I_4]^T$ is independently classified, eliminating any reliance on spatial continuity or global shape hypotheses.

3. Mathematical Formulations and Computational Algorithms

Spectral Sampling Approximation:

The reflectance spectrum near $\lambda_i$ is discretely integrated: $I_i = \sum_{k} E_k R_k T_{i,k} S_k \Delta \lambda$ where $k$ indexes discrete spectral samples.

MLP Classifier Structure:

$h^{(1)} = \mathrm{ReLU}(W^{(1)}x + b^{(1)}) \in \mathbb{R}^{16}$

$h^{(2)} = \mathrm{ReLU}(W^{(2)} h^{(1)} + b^{(2)}) \in \mathbb{R}^{8}$

$s = W^{(3)} h^{(2)} + b^{(3)} \in \mathbb{R}^{49}$

The inferred class is $\hat{k} = \arg\max_{k \in \{1,\ldots,49\}} s_k$ , with final decision mask: $m(x) = \begin{cases} 1,& \hat{k} \in \{\text{clothing classes}\},\ 0,& \text{otherwise}. \end{cases}$ No spatial context is applied beyond subsequent morphological and connected-components postprocessing.

4. Practical Implementation on Edge Hardware

The deployed platform comprises an Asahi PiTOMBO compound-eye camera and a Raspberry Pi 5 (BCM2712, 4 cores @2.4 GHz, 8 GB RAM), operating entirely without GPU resources.

Software Stack:

MLP inference via ONNX Runtime
Pre/post-processing via Python and OpenCV
White-balance coefficient extraction per acquisition environment

Performance Profile:

Resolution: 253×190 pixels (2×2 binning)
Mean per-frame timings:
- RAW capture & preprocessing: 17.8 ms
- MLP inference: 13.4 ms
- Morphological and bounding box extraction: 9.5 ms
- Miscellaneous (I/O, overhead): 2.4 ms
- Total: 43.1 ms (23.2 fps end-to-end)

This operational profile establishes the suitability of HitoMi-Cam for real-time, battery-powered deployment scenarios.

5. Quantitative Performance and Comparative Assessment

Metrics:

IoU-based detection (threshold $\tau=0.2$ )
AP, precision, recall as standard

Empirical Results (Table 1):

Model	General Scene	Simulated SAR Scene	Swing Scene
HitoMi-Cam	0.340	0.935	0.957
EfficientDet-L0	0.780	0.370	0.733
MobileNet-V1	0.688	0.409	0.459
YOLOv5n	0.864	0.506	0.762
YOLOv5s	0.936	0.520	0.930
YOLOv5m	0.961	0.524	0.974
YOLOv5l	0.968	0.538	0.970
YOLOv5x	0.978	0.536	0.976

In “General Scene” (upright, canonical human poses), YOLOv5 variants and other CNNs outperform HitoMi-Cam due to their superior outline modeling, with HitoMi-Cam registering an AP of 0.34. In contrast, for “Simulated SAR Scene” — encompassing non-canonical postures, occlusions, and ground-placed garments — HitoMi-Cam (AP 0.935) surpasses all CNN baselines (best CNN AP 0.538). In “Swing Scene” (dynamic, blurred, extreme postures), HitoMi-Cam (AP 0.957) matches or exceeds the best CNNs and outperforms lightweight models by a substantial margin.

Across all scenarios, HitoMi-Cam yields the lowest absolute false positive count, benefiting from its physical-material discrimination criterion. Processing throughput remains high (23.2 fps), with negligible degradation compared to CNNs, which slow substantially on CPU in the same contexts.

6. Failure Modes, Limitations, and Complementary Roles

Limitations and Failure Cases:

Reliance on clothing as a proxy for persons fails in cases of minimal or unconventional garments.
Spectral confusion arises with certain backgrounds (e.g., dry vegetation), causing misclassification due to similar reflectance.
Performance is contingent on broadband daylight; efficacy under nighttime or atypical lighting remains untested.
Environmental occlusion, such as mud or dust, negatively impacts reliability.
Non-clothing objects (colored tarps, plastics) sharing spectral similarity may register as false positives.
Pixelwise segmentation yields bounding boxes that systematically underestimate ground-truth object extents, resulting in lower Intersection-over-Union (IoU) than tightly outlined methods.

Known Failure Scenarios:

In standard scenes, only partial segmentation of body regions; merged detections for proximal individuals leading to person undercounts.

Complementary Use-Cases:

HitoMi-Cam is not proposed as a comprehensive replacement for CNN-based detectors but as an adjunct for environments with poorly-constrained object geometry, including SAR, complex poses, and rapid inference conditions.
Candidate deployments include front-end filtering, providing region proposals to subsequent shape-based CNNs or human operators, enhancing overall detection robustness and computational efficiency (cascaded fusion).
The system can be fused with active IR or thermal modalities for scenarios lacking shape cues (e.g., low light).

7. Conclusion

HitoMi-Cam establishes that a purely spectral, pixelwise learning-based classifier is capable of real-time person detection under physically challenging and shape-divergent conditions, operating at $>$ 23 fps on commodity edge hardware. The method achieves low false positive rates and high AP in SAR and extreme-pose scenarios, addressing limitations of conventional CNNs that depend on spatial priors. By functioning as a material-aware prefilter or complementary channel, HitoMi-Cam expands the operational range of automated person detection systems, particularly where shape unpredictability or computational constraints are decisive factors.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to HitoMi-Cam.