Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 67 tok/s

Gemini 2.5 Pro 51 tok/s Pro

GPT-5 Medium 21 tok/s Pro

GPT-5 High 32 tok/s Pro

GPT-4o 120 tok/s Pro

Kimi K2 166 tok/s Pro

GPT OSS 120B 446 tok/s Pro

Claude Sonnet 4.5 35 tok/s Pro

2000 character limit reached

HICO-DET Dataset for HOI Detection

Updated 9 October 2025

HICO-DET is a large-scale dataset offering detailed instance-level annotations that link human and object bounding boxes with explicit HOI labels.
It comprises 600 diverse HOI categories across 47,776 images, employing mAP evaluation and strict IoU criteria for performance measurement.
The dataset fuels innovation in HOI detection methodologies by addressing challenges like systematic generalization and class imbalance through balanced splits and synthetic augmentation.

The HICO-DET dataset is a large-scale, exhaustively annotated benchmark specifically designed for advancing the detection of human–object interactions (HOIs) in static images. It comprises structured instance-level labels that link human and object bounding boxes with explicit interaction class annotations. Developed by augmenting an earlier image-level HOI classification benchmark, HICO-DET catalyzed significant progress in fine-grained understanding of human-object activity in visual scenes and has become foundational for model development, evaluation, and systematization in HOI research.

1. Dataset Construction and Annotation Protocol

HICO-DET is built upon the original HICO classification dataset, expanding it with exhaustive instance-level HOI annotations. The dataset contains 47,776 images, split into dedicated training and test sets. The annotation scheme covers 600 distinct HOI categories, encompassing 80 common object classes and capturing a wide variety of real-world human activities.

Annotation is performed in a three-step procedure for each interaction:

Drawing bounding boxes around each person involved in a relevant interaction.
Drawing bounding boxes around the corresponding interacted object(s).
Linking each person–object pair to an interaction class label (e.g., "riding a bicycle").

On average, each positive image annotation includes 1.67 interaction instances, and each HOI label links to approximately 2.83 distinct person/object bounding boxes, reflecting situations where individuals or objects participate in multiple interactions. The dataset collectively provides 90,641 positive HOI labels and approximately 151,276 annotated HOI instances.

HICO-DET distinguishes itself from predecessors by both the breadth of interaction categories and the exhaustiveness of its instance-level labeling. Earlier datasets either covered a much narrower scope of HOI classes or did not comprehensively label all interactions present per scene (Chao et al., 2017).

2. Dataset Structure and Evaluation Metrics

Instances in HICO-DET are defined by paired bounding boxes for the human and object, accompanied by the interaction class. The standard criterion for a correct detection is given by:

$\min(\mathrm{IoU}_h, \mathrm{IoU}_o) > 0.5$

where $\mathrm{IoU}_h$ and $\mathrm{IoU}_o$ are the Intersection over Union values of the predicted human and object bounding boxes with ground truth, respectively.

Evaluation of detection performance is primarily based on mean Average Precision (mAP), calculated under two main scenarios:

Default Setting: All test images, including those without the target object.
Known Object Setting: Only images known to contain the relevant object class.

mAP is reported for the entire set of 600 HOI categories, as well as split into "Rare" (less than 10 training instances) and "Non-Rare" categories.

3. Influence on Model Architectures and Methodologies

HICO-DET enabled development of novel model paradigms, notably the Human-Object Region-based Convolutional Neural Network (HO-RCNN; (Chao et al., 2017)), which introduced the "Interaction Pattern" representation. This dual-channel binary image encodes the spatial configuration of human and object bounding boxes within a scene’s attention window, facilitating translation invariance and typifying the spatial cues characteristic of HOIs. The Interaction Pattern stream, implemented via convolutional layers, yielded significant increases in detection mAP. Additional refinement—by incorporating object detection scores—further improved precision by discounting likely false positives.

A detailed table summarizes the canonical labeling and evaluative properties:

Property	Description	Average/Total Value
Images	Number of annotated images	47,776
HOI Categories	Distinct interaction types	600
Object Classes	Unique objects	80
HOI Labels (instances)	Annotated HOI instances	~151,276
Avg. HOI per image	Avg. interaction instances per positive image	1.67
Avg. boxes per HOI	Avg. bounding boxes per positive HOI label	2.83
Scoring metric	Detection evaluation criterion	$\min(\mathrm{IoU}_h, \mathrm{IoU}_o) > 0.5$

4. Systematic Generalization and Alternative Data Splits

Subsequent analyses identified systematic generalization as a major challenge for HOI models. The HICO-DET-SG split (Takemoto et al., 2023) enforces strict separation of object–interaction combinations between train and test sets: from the 600 possible HOI combinations, 540 are used during training, and the remaining 60 reserved exclusively for testing. This split challenges models to generalize to unseen combinations rather than relying on memorized co-occurrence statistics.

Empirical results show that all model architectures suffer pronounced drops in mAP when evaluated on SG splits, especially one-stage transformer-based approaches. Two-stage models that decouple instance and relation reasoning, such as STIP, demonstrate relatively stronger generalization. The split is created using an algorithmic process that ensures coverage of all objects and actions in the training set, and no overlap in the combination space between splits.

5. Class Imbalance, Synthetic Augmentation, and B-RIGHT

HICO-DET exhibits severe long-tailed class imbalance, with some HOI categories represented by thousands of images, while others appear sparsely. This imbalance distorts AP metrics and confounds fair model comparisons. The B-RIGHT (Jang et al., 28 Jan 2025) extension addresses these limitations by algorithmically constructing a balanced dataset where each of 351 selected HOI categories has exactly 50 instances in training and 10 in testing. Additionally, a zero-shot test split provides uniform coverage of entirely novel HOI classes.

B-RIGHT employs a balancing algorithm, iteratively adding/removing images with full awareness of multi-label distributions. To supplement rare categories, it uses retrieval-augmented generation: sourcing image-text prompts from a vision-LLM (VLM) based on template queries, then generating synthetic images using a diffusion model (SDXL-lightning), followed by filtering through open-world detection and LLM-based confirmation. This ensures only high-quality, correctly instantiated synthetic images per HOI class.

Re-evaluation of existing models on B-RIGHT reveals substantial reductions in AP score variance and altered performance rankings. Architectures that previously performed strongly in imbalanced regimes lost standing, while those with robust, balanced representations—such as decoupled PViC and UPT—rose to prominence.

6. Role in Model Development, Analysis, and Future Directions

HICO-DET has functioned as a central resource for benchmarking, analysis, and methodological innovation in HOI detection. Core impacts include:

Enabling statistical significance testing of architectural choices (e.g., the use of interaction patterns; paired $t$ -tests demonstrated $p \< 0.05$ improvements (Chao et al., 2017)).
Supporting hierarchical, two-stage frameworks leveraging interactiveness filtering, which improved precision and rare-category recognition (Li et al., 2018).
Serving as the foundation for weakly supervised and transferable knowledge paradigms, where HOI context (human pose + verb semantics) powers object localization in rare/unseen categories (Kim et al., 2019).
Facilitating systematic generalization and benchmarking of compositional reasoning with the SG split (Takemoto et al., 2023).
Revealing the impact of class imbalance and prompting the design of balanced datasets and synthetic augmentation pipelines (Jang et al., 28 Jan 2025).

Ongoing research aims to improve systematic generalization via modular architectures, vocabulary expansion, diverse pretraining (including vision-LLMs and scene graph generation), and multimodal contextual reasoning.

7. Common Misconceptions and Discussion of Limitations

Memorization vs. Generalization: Empirical evidence from the SG splits demonstrates that high mAP scores on the original HICO-DET split are not indicative of a model’s ability to systematically generalize. Many models primarily exploit memorized object–interaction co-occurrences.
Metric Reliability: Evaluation metrics such as AP may be inflated or deflated by the extreme class imbalance in the original dataset. Balanced datasets such as B-RIGHT yield more reliable and interpretable scores.
Bounding Box Exhaustiveness: While HICO-DET annotations are extensive, recall errors can occur due to proposal limitations in candidate generation, with a top-10 proposal recall of only ~46.75% (Chao et al., 2017). This places an upper bound on achievable mAP and motivates improved proposal or query mechanisms.

A plausible implication is that future progress in HOI detection will depend on both enhanced compositional reasoning and the adoption of more principled balanced benchmarks to ensure accurate, reproducible measurement of generalization capabilities across the HOI category space.