Occupancy Detection Dataset Overview

Updated 31 August 2025

Occupancy detection datasets are empirically curated collections using diverse sensors (e.g., CO₂, thermal, radar) to estimate presence and spatial configuration in various environments.
They facilitate algorithm development by incorporating advanced preprocessing, feature extraction, and synchronization methods to optimize ML model performance.
Applications span smart energy management, healthcare monitoring, surveillance, and autonomous navigation, highlighting their practical impact and research significance.

Occupancy detection datasets are empirically grounded data collections designed to support the development, validation, and benchmarking of algorithms that estimate the presence, number, or precise spatial configuration of occupants—humans, objects, or vehicles—within indoor or outdoor environments. These datasets are foundational in a wide range of applications, including smart building energy management, real-time resource allocation, privacy-preserving surveillance, health monitoring, and autonomous navigation. The diversity in sensing modalities (e.g., temperature, CO₂, acoustic, camera, radar, accelerometer, smart meter data) and scale (from single rooms to city blocks) gives rise to a broad taxonomy of occupancy datasets, each imposing distinct requirements on data annotation, preprocessing, model design, and downstream evaluation.

1. Sensor Modalities and Data Collection Paradigms

Occupancy datasets span a spectrum of sensor arrangements and spatial resolutions, strongly influencing feature extraction and algorithm selection:

Multi-Sensor Environmental Datasets: Early and influential examples such as the inductive decision-based classroom dataset (Jain et al., 2016) employed temperature, CO₂, and frequency (for reverberation time computation) sensors. Data sampling at the typical lecture interval (every ~50 minutes) yielded synchronized multi-feature vectors per event, with reverberation time ( $T = 0.161\, V / (S a')$ ) acting as the most discriminative feature for occupancy.
Single-Sensor Time Series: CD-HOC (Arief-Ang et al., 2017) advanced occupancy estimation using only carbon dioxide sensors by leveraging trend, seasonal, and irregular decomposition models and correcting time-lag effects via NRMSE optimization.
Smart Meter and Appliance Usage Data: Residential datasets such as those in (Lee et al., 2022, Luo et al., 2022), and (Liang et al., 2023) repurpose high-resolution power consumption, appliance state, and environmental sensor readings (temperature, humidity) to build occupancy classifiers using both supervised (SVM, autoencoder, FCN, transformer-RNN) and unsupervised learning approaches.
Vision-Based Occupancy: Datasets such as TIDOS (Cokbas et al., 2020), leveraging low-resolution thermal sensors, and those designed for camera-based parking lot and library seat occupancy (Duong et al., 2022, Yang et al., 2023) use annotated images and video streams. Object detectors (RetinaNet, Faster-RCNN, OcpDet) operate on bounding boxes, keypoints, or foreground masks, augmented in some cases by synthetic virtual images for robust scenario coverage.
Radar and Accelerometer-Based Datasets: UWB radar and smartphone-embedded accelerometers (Möderl et al., 2023, Pahar et al., 2022) extend occupancy detection to automotive and healthcare settings, respectively, emphasizing modality-specific pre-processing, ablation of static/background components, and high sensitivity detection for noisy or bursty signals.

Dataset assembly typically combines ground-truth annotation (manual, sensor-based, or event-driven), careful time synchronization, and frequent calibration. Privacy is addressed by favoring modalities such as thermal imaging or nonintrusive sensors, especially in sensitive domains.

2. Feature Engineering and Preprocessing Strategies

Feature construction in occupancy datasets is anchored in the physical relevance and discriminative capacity of recorded signals:

Acoustic Features: Reverberation time, derived from frequency response, is computed via the Sabine equation and further processed using thresholding ( $T < 0.45$ sec as a crucial discrimination cutoff (Jain et al., 2016)).
Environmental and Appliance Features: Vectorization of per-device and per-sensor time series and their aggregation (e.g., 15-minute or hourly bins) support input to ML models. Dimensionality reduction via PCA or t-SNE aids in visual validation of class separability (Lee et al., 2022).
Seasonal and Trend Decomposition: STD and STL techniques decompose CO₂ or similar time series into additive components aligned with occupancy cycles; feature alignment is verified using Pearson correlation and NRMSE (Arief-Ang et al., 2017).
Signal Processing for Radar/Accelerometer: Mean subtraction, stacking of real and imaginary parts (radar), windowed power spectra, RMS, crest factor, and kurtosis provide features tuned for both brief events and long, low-activity intervals (Möderl et al., 2023, Pahar et al., 2022).
Vision Features: Engineered via learned convolutions, spatial activation masks, and attention mechanisms (parallel attention, 3D U-Net, MaskFormer) (Luo et al., 2022, Aung et al., 18 Dec 2024, Jia et al., 2023), these address the inherent challenges of occlusion, variable scale, and real-vs.-synthetic domain adaptation.

Preprocessing extends to temporal synchronization, normalization, noise filtering (e.g., Markov random field smoothing in TIDOS (Cokbas et al., 2020)), class balancing (random oversampling for highly imbalanced classes), and the construction of auxiliary data structures (occupancy maps, mask images, object-centric RoIs).

3. Annotation, Benchmarking, and Novel Dataset Types

Advancements in occupancy research have motivated the creation of several public datasets and benchmarking protocols with increasingly complex annotations:

Ego-Centric vs. Object-Centric Datasets: Traditional datasets focus on fixed egocentric occupancy grids at coarse resolution (e.g., 3D voxel grids in OpenOccupancy, Occ3D (Wang et al., 2023, Tian et al., 2023)), with voxels labeled either by traversability or semantic class (e.g., 17 classes in nuScenes-Occupancy). Recent innovations create dynamic, object-centric occupancy datasets with high resolution by aggregating temporally-aligned LiDAR points within moving bounding boxes, addressing the challenge of spatial aliasing and “jaggedness” encountered in scene-level extraction (Zheng et al., 6 Dec 2024).
Synthetic Data and Anomaly Augmentation: Modern approaches recognize the scarcity of naturally occurring out-of-distribution (OoD) events and exploit synthetic generation pipelines (e.g., Physics-Guided Anomaly Synthesis Pipeline integrating image generation and depth estimation (Zhang et al., 26 Jun 2025)) to create datasets (VAA-KITTI, VAA-KITTI-360) for OoD semantic occupancy prediction. Such datasets offer controlled evaluation on hybrid real-synthetic classes, supporting hard negative mining and robustness evaluation.
Multi-Scale and Multi-View Datasets: Datasets like MVP-Occ (Aung et al., 18 Dec 2024) leverage virtual world simulation (CARLA), rendering multiple urban scenes at high frame rates and labeling at voxel-level across five semantic classes (Free, Pedestrian, Ground, Wall, Others), as well as providing 2D ground-plane annotations for pedestrian location and occupancy.

Benchmarking encompasses not only in-distribution occupancy detection but now systematically incorporates OoD anomaly detection using sophisticated precision-recall metrics within spatial tolerances.

4. Model Development, Training, and Evaluation Protocols

Occupancy detection datasets serve as training corpora for supervised and unsupervised ML models, as well as performance evaluation targets. Typical pipeline elements include:

Model Architectures: Classical decision trees (ID3 in KNIME (Jain et al., 2016)), SVMs and MLPs, deep convolutional object detectors (RetinaNet, Faster RCNN, MaskFormer), and transformer/RNN hybrids (Liang et al., 2023), with special layers for attention (e.g., parallel attention in ABODE-Net (Luo et al., 2022)) and spatial consistency (spatial estimator modules in OcpDet (Duong et al., 2022)).
Temporal Information and Memory: LSTM networks (accelerometer-based detection (Pahar et al., 2022)), causal transformers, and memory propagation strategies provide robustness to transient detection errors.
Loss Functions: Weighted cross-entropy, Lovász-Softmax (for improved IoU on rare classes), affinity-based scene geometry/semantic loss, and custom error terms (e.g., $L_\mathrm{size}$ for bounding box tightness (Duong et al., 2022)) optimize the learning process for class imbalance, spatial consistency, and fine segmentation.
Validation and Metrics: Multi-fold cross validation (e.g., 7-fold in (Jain et al., 2016)), Mean Absolute Error, windowed count-change correct classification rate ( $\mathrm{CCR_{WCC}}$ (Cokbas et al., 2020)), AUC for occupancy events (accelerometer and radar), average precision (AP), mean IoU ( $\mathrm{mIoU}$ ), panoptic quality (PQ), and OoD metrics (AuROC, region-tolerant AuPRC $_r$ (Zhang et al., 26 Jun 2025)).

Evaluation is typically conducted on both component and system level (e.g., per-occupant error, event detection, IoU for completions in object-centric completion (Zheng et al., 6 Dec 2024)).

5. Applications, Generalization, and Limitations

Occupancy datasets have enabled broad applications:

Smart Building Systems: Real-time energy management via room-level occupancy (automated HVAC and lighting control), space utilization, and privacy-preserving access control (thermal or smart meter-based approaches (Luo et al., 2022, Cokbas et al., 2020)).
Healthcare Monitoring: Non-intrusive accelerometer-based bed occupancy for disease (e.g., TB cough (Pahar et al., 2022)) and elderly care monitoring.
Autonomous Navigation and Urban Surveillance: High-resolution, multi-view, and panoptic occupancy datasets support robust motion planning, collision avoidance, and crowd analysis in urban environments, especially notable in synthetic and large-scale datasets (Aung et al., 18 Dec 2024, Wang et al., 2023, Tian et al., 2023).
Parking and Library Management: Domain-specific detectors and datasets, often leveraging transfer learning and synthetic data, have driven advances in scalable infrastructure management (Duong et al., 2022, Yang et al., 2023).
Out-of-Distribution Detection and Scene Understanding: Recent benchmarks and datasets challenge models to identify and exclude anomalies, critical for robust deployment in dynamic environments (Zhang et al., 26 Jun 2025).

Generalization hinges on accurate sensor calibration, feature selection, and model retraining for new physical or environmental conditions—especially for reverberation time, which must be recomputed for each architectural context (Jain et al., 2016). Limitations commonly include data sparsity for rare events, sensor noise and drift, annotation errors, challenges in managing domain adaptation between synthetic and real data, and computational constraints for dense 3D methods.

6. Future Directions and Research Implications

The evolution of occupancy detection datasets is characterized by increased resolution, multi-modal integration, and a deliberate expansion into complex settings, such as out-of-distribution safety, synthetic-to-real domain adaptation, and panoptic 3D scene labeling. Methodological innovations such as object-centric implicit occupancy completion (Zheng et al., 6 Dec 2024), progressive geometry-semantic fusion (Zhang et al., 26 Jun 2025), and multi-resolution benchmarking frameworks (Wang et al., 2023, Tian et al., 2023) point to a sustained trend toward richer, more generalizable data resources.

A plausible implication is that continued integration of synthetic anomaly data, self-supervised learning techniques, and hybrid object/activity-centric labeling will play a pivotal role in advancing both robustness and interpretability of occupancy detection models across domains. There remains a persistent demand for open, standardized datasets—especially those supporting real-time, privacy-preserving, and cross-modal occupancy estimation—for rigorous comparison and reproducibility in research and applied system deployment.