HomeSafeBench: Dynamic Home Safety Benchmark
- HomeSafeBench is a benchmark for evaluating embodied vision-language models in domestic safety inspections by leveraging simulated environments and multi-action exploration paradigms.
- It generates 12,900 dynamically-created samples with hazards categorized in residential scenarios like kitchens, bathrooms, and living rooms, using precise annotations and rule-based placements.
- Evaluation using precision, recall, and F1 score reveals that current models lag far behind human inspectors, highlighting key research opportunities in embodied AI.
HomeSafeBench is a large-scale benchmark developed expressly to evaluate the capabilities of embodied Vision-LLMs (VLMs) in the context of home safety inspection. Addressing fundamental limitations in prior work, HomeSafeBench provides 12,900 dynamically-generated inspection samples across common residential environments, supports agent-driven, multi-perspective free exploration, and employs rigorous precision-recall-based metrics for hazard identification. The benchmark exposes severe deficiencies in current VLMs, yielding an upper-bound F1-score of only 10.23%, far behind human inspection performance. HomeSafeBench aims to guide future research on embodied vision-language agents in safety-critical domestic inspection and related embodied AI tasks (Gao et al., 28 Sep 2025).
1. Benchmark Construction and Objectives
HomeSafeBench is designed to evaluate embodied VLMs under ecologically valid conditions that reflect the complexity and variability of real-world home safety inspections. The benchmark addresses two technical shortcomings present in previous datasets: (1) reliance on text-only scene descriptions which abstracts away critical spatial and visual features, and (2) use of a single, fixed observational viewpoint which occludes or omits hazards not visible from that perspective. HomeSafeBench utilizes the VirtualHome simulation engine to generate dynamic, first-person perspective images for agents operating freely within the environment. Each inspection task is constructed as a distinct sample, with multiple safety hazards distributed throughout spaces such as kitchens, bathrooms, bedrooms, and living rooms.
2. Dataset Composition and Hazard Categories
The benchmark encapsulates 12,900 inspection samples, each consisting of a simulated environment containing potential safety hazards. Hazards are selected from five representative residential categories:
| Hazard Category | Example Scenario | Annotation Notes |
|---|---|---|
| Fire Hazard | Flammable item near heat source | Location + object attribute |
| Electric Shock | Appliance in contact with water | Sink, tub, countertop |
| Falling Object | Items placed at edge/shelf/top | Height and proximity flags |
| Trip Hazard | Objects on walking surfaces | Floor-level annotations |
| Child Safety | Dangerous items accessible to children | Reachability, type |
Hazard representation is achieved via an initial annotation step that marks candidate hazard locations within each room (e.g., stove, sink, shelf). Subsequently, objects are assigned attributes according to predefined criteria (flammable, electrical, trip-risk, etc.), and samples are generated with a rule-based process that places hazard-congruent objects in annotated locations. Human annotation and cross-validation procedures ensure fidelity in spatial configurations and hazard categorization.
3. Agent Exploration Paradigm
Unlike static-view benchmarks, HomeSafeBench empowers agents with dynamic free exploration capabilities. The agent receives sequences of first-person images and executes navigation primitives such as move-forward, turn-left, turn-right, and look-up. This design enforces two requirements: agents must employ spatial reasoning and action planning to strategically traverse the environment, and they must integrate visual perception over multi-step interactions to discover occluded or partially-visible hazards. The inspection protocol mirrors real-world inspection procedures where comprehensive coverage is possible only through active exploration.
A plausible implication is that the dynamic exploration paradigm leads to more representative difficulty profiles for embodied models, as hazards are intentionally positioned to be partially occluded or only visible from certain vantage points.
4. Evaluation Protocols and Metrics
HomeSafeBench quantifies model performance through set-based comparison of reported hazards () to ground-truth hazards () per inspection sample. The evaluation protocol utilizes a rule-based matching system to accommodate synonym variations in object names. Three primary metrics are computed:
- Precision:
- Recall:
- F1 Score:
Metrics are micro-averaged over the entire dataset and across room/hazard categories. These rigorous formulas directly support per-sample and aggregate benchmarking.
5. Empirical Evaluation and Model Performance
Comprehensive benchmarking across mainstream VLMs revealed pronounced shortcomings in both hazard identification and navigation. The best model (Gemma3-12B) attained an F1 score of 10.23%. By comparison, human inspectors operating on the same dataset reached a micro-average F1 score of 75.36%. Models frequently underreported hazards, incorrectly categorized hazards, or failed to select navigation actions that provided optimal coverage. Extended multi-turn interactions led to degradation in inspection efficiency, evidenced by declining precision and recall over longer action sequences.
Room-level analysis indicated models performed better in confined environments (e.g., bathrooms) and poorly in areas with high object density (e.g., living rooms). This suggests that scene complexity poses compounded challenges to visual understanding and action planning in embodied settings.
6. Implications for Embodied Vision-Language Research
HomeSafeBench exposes critical gaps in the capabilities of embodied VLMs for safety-sensitive inspection tasks. Subpar performance in both hazard perception and exploration strategy demonstrates that current approaches do not generalize well to high-dimensional, cluttered, or dynamically interactive environments. These findings carry two major implications: (i) progress in embodied perception must address multi-perspective scene parsing, robust object-hazard association, and long-horizon navigation strategies; (ii) evaluation of VLMs for domestic robotics and safety-critical domains must be performed under conditions reflective of true physical complexity, not abstracted textual or static paradigms.
A plausible implication is that advances in navigation planning (possibly through reinforcement or continual learning) and in vision-language scene reasoning will be necessary for embodied agents to approach or surpass human-level inspection competency.
7. Future Research Directions
HomeSafeBench highlights several axes for future paper:
- Development of navigation and exploration policies that optimize environment coverage and hazard discovery over multi-turn interactions, mitigating efficiency degradation.
- Enhancement of hazard identification by integrating more sophisticated vision-language reasoning architectures capable of associating observed objects with probable hazards given visual context.
- Investigation into transfer learning from simulation to real-world environments, enabling agents tested on HomeSafeBench to generalize to physical home inspections with minimal domain adaptation.
- Exploration of continual learning strategies to support persistent improvement as agents encounter new room layouts and novel hazard types over time.
These directions are directly motivated by the deficiencies revealed in the benchmark and are predicted to be central to the progression of embodied AI for domestic safety inspection (Gao et al., 28 Sep 2025).
In summary, HomeSafeBench constitutes a comprehensive, dynamic benchmark for embodied VLM evaluation in free-exploration home safety inspection. Its scale, methodological intricacy, and robust performance assessment framework establish rigorous baselines while sharply outlining the limitations of contemporary models, serving as a foundation for future embodied safety research.